-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LossTensor is inf or nan while training ssd_inception_v2 model in my own dataset. #1881
Comments
Hi @Wenstery - It's a bit hard to say without knowing more, but typical ways to deal with this are to reduce the learning rate or to increase the batch size. I also recommend using a GPU if possible as it will significantly decrease your turnaround time on these experiments. |
Hi @jch1 -It's also happend when training on a GPU,I 'll try to reduce the learning rate. |
To provide another data point, I had the same bug on both |
Thanks @timisplump - also maybe kick up the batch size to 32 if you can hold it in memory. |
@jch1 I'm unfortunately training on images that are too large (504x960) to hold a batch size of 32. I'm currently using a batch size of 12. When I moved to 16, the training worked fine until it tried to store a checkpoint, at which point I got an OOM error. I could downsample the images and might try that later, but for now I'd like to see how the models do with this image size |
@timisplump does reducing the learning rate help? |
@cy89 well I only started the new task with a lower learning rate about an hour ago but it hasn't crashed. Last time, it crashed after running for about 10 hours, so i can't give you a great response yet. Will let you know when I check back on Monday! |
I had the same problem. The real reason is your annotation have problem. If any boundingbox value is Nan in your annotation will cause this problem. You should go back and check your annotation. That's how I solve this problem when training on my own dataset. |
@cy89 Sorry for the slow reply. Been busy with other stuff. |
Same problem with rfcn model. The training process crashed after 573k iterations and reported the same error. @lzkmylz What exactly do you mean for NAN in annotations? NAN is generated by the computation that in annotations there cannot be this kind of number. |
Had the same problem. I double checked my annotations and found some bounding boxes outside the image boundaries. Fixing the bounding boxes solved the problem for me. |
I'm suffering from the same problem. Anybody got any more experience on this, any other way that I can try? |
I noticed that the optimizer for SSD incetion v2 is RMSPropOptimizer. |
I kept having this issue with SSD MobileNet_v1 and after fixing bounding boxes out of range and removing objects that were too small (less than 15% of the image) I have had no issues training so far. |
Is is this a problem allocated with the data set ? |
Yeah. This loss error occurs when there's some kind of wrong annotation . I tried with different data set it works really fine. It's always good keep a track of following things ,
|
Do you have any advices on range of values that bboxes should be with? Curiously when I apply batch normalization during training, I'm getting no nan/inf but constant loss :(
|
I am also having this problem. My images do not have any sort of orientation, so I am have flipped,flopped,transposed,transversed, and rotated my images by 90,-90, and 180 degrees. It works great on the original dataset, but not on artificially augmented training set. |
@Wenstery , I took another look back over my dataset. I checked the annotations for boxes outside of the boundaries, and I didn't find any. I am using the faster_rcnn_inception_resnet_v2_atrous_coco model. I do have very small boxes. Oddly enough I found that when I tried to use the flipped (horizontal), flopped (vertical), transverse (y=x), or rotate 90/180/-90 I got the NaN error, but when I used the transposed dataset it was fine. I noticed in the newest tensorflow models there are pre processing steps for vertical and rotation 90. There is already a flip horizontal preprocess step in there. Just waiting for the tensorflow 1.4 release on conda to try it out! |
This issue appears to be resolved, although it looks like the discussion has to do with something that could be clarified in documentation or contributed as a QoL improvement -- if anyone feels strongly about this, please feel free to put together a PR or file a specific feature request. Thanks! |
Same issue. DataSet is checked and sound - no empty BB, no BB outside image. DataSet working fine when fine-tuning ssd_mobilenet_v1. I removed "small" bounding boxes (with width or high less than 20px). Although it is a little annoying to have to remove those boxes for some of my classes... but at least the training can run. Thanks all for the help |
After an investigation, I found the small size of samples are not the root cause of the crash for my own dataset. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.) |
For reference, here's what you need to verify with your annotations:
Verify with your label_map.pbtxt file:
Verify with your pipeline config:
Python validation script if you want to use it for annotation files. Just set the directory folder and run the script.
|
I have a question. I want to specify background images in my training data. I have read that background images should not have any ground truth boxes. Then how do I specify the CSV file for background images? |
I have ran the same code to check my data and they all pass however, I can't get passed 20 steps for training before the Check Numerics comes up. Could this be an issue that Im using the legacy train file instead of the new model_main.py |
If your training works for a few steps and then you get this (Tensor had NaN values) error, the error has nothing to do with GPUs memory. This error has probably something to do with your dataset. If you have changed the number of classes, it can make a problem. I trained my own dataset on top of Xception_65 with 6 classes and it worked pretty well. However, when I changed the annotation to two classes (building and background), I started to get this error. But then I realized that makes total sense because, in the process of reclassifying the training datasets, I had some images that ended up being all in background class (I had no building in those images and all pixels are in 0 value ). |
I am getting the same issue with the models/research/object_detection/legacy/train.py with TensorFlow 1.15. Just wondering if there is anything else I should try to fix this issue. UPDATE: Not sure why it worked with the new one though... |
@airman00, i run your script for annotation validation. I got below error may i know what does it mean:
I set the annotation .xml file directory path in script. |
@sainisanjay try float( ) instead of int( ) |
what will be the issue if my annotated bounding box coordinate with |
The problem is in the data augmentation, delete the crop function. I only let the split. |
Hi @airman00 , I got the below error. Could you help me to know why I am getting? ...... |
@pranatitrivedi8 look like some of your xml file don't have bndbox object that's why it is giving error |
Use google object_detection api.
TensorFlow version 1.2.0
Describe the problem
LossTensor is inf or nan while training ssd_inception_v2 model in my own dataset.
training config:
train_config: {
batch_size: 24
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.004
decay_steps: 800720
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
#fine_tune_checkpoint: "/opt/user/awens/tf/models-master/object_detection/models/model/pre_trained/ssd/model.ckpt"
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
Source code / logs
INFO:tensorflow:global step 3161: loss = 7.0152 (16.166 sec/step)
INFO:tensorflow:global step 3162: loss = 6.2710 (18.039 sec/step)
INFO:tensorflow:global step 3163: loss = 6.5963 (16.896 sec/step)
INFO:tensorflow:global step 3164: loss = 6.6896 (15.880 sec/step)
INFO:tensorflow:global step 3165: loss = 7.1895 (15.575 sec/step)
INFO:tensorflow:Recording summary at step 3165.
INFO:tensorflow:global step 3166: loss = 6.5400 (20.047 sec/step)
INFO:tensorflow:global step 3167: loss = 7.0436 (15.845 sec/step)
INFO:tensorflow:global step 3168: loss = 6.8610 (16.426 sec/step)
INFO:tensorflow:global step 3169: loss = 7.8241 (15.983 sec/step)
INFO:tensorflow:global step 3170: loss = 7.3034 (15.400 sec/step)
INFO:tensorflow:global step 3171: loss = 6.5742 (16.132 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'CheckNumerics', defined at:
File "./object_detection/train.py", line 198, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "./object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/opt/python-project/models-master/object_detection/trainer.py", line 221, in train
total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 415, in check_numerics
message=message, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"]]
The text was updated successfully, but these errors were encountered: