Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LossTensor is inf or nan while training ssd_inception_v2 model in my own dataset. #1881

Closed
Wenstery opened this issue Jul 7, 2017 · 34 comments

Comments

@Wenstery
Copy link

Wenstery commented Jul 7, 2017

Use google object_detection api.

TensorFlow version 1.2.0

Describe the problem

LossTensor is inf or nan while training ssd_inception_v2 model in my own dataset.
training config:
train_config: {
batch_size: 24
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.004
decay_steps: 800720
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
#fine_tune_checkpoint: "/opt/user/awens/tf/models-master/object_detection/models/model/pre_trained/ssd/model.ckpt"
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}

Source code / logs

INFO:tensorflow:global step 3161: loss = 7.0152 (16.166 sec/step)
INFO:tensorflow:global step 3162: loss = 6.2710 (18.039 sec/step)
INFO:tensorflow:global step 3163: loss = 6.5963 (16.896 sec/step)
INFO:tensorflow:global step 3164: loss = 6.6896 (15.880 sec/step)
INFO:tensorflow:global step 3165: loss = 7.1895 (15.575 sec/step)
INFO:tensorflow:Recording summary at step 3165.
INFO:tensorflow:global step 3166: loss = 6.5400 (20.047 sec/step)
INFO:tensorflow:global step 3167: loss = 7.0436 (15.845 sec/step)
INFO:tensorflow:global step 3168: loss = 6.8610 (16.426 sec/step)
INFO:tensorflow:global step 3169: loss = 7.8241 (15.983 sec/step)
INFO:tensorflow:global step 3170: loss = 7.3034 (15.400 sec/step)
INFO:tensorflow:global step 3171: loss = 6.5742 (16.132 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op u'CheckNumerics', defined at:
File "./object_detection/train.py", line 198, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "./object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/opt/python-project/models-master/object_detection/trainer.py", line 221, in train
total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 415, in check_numerics
message=message, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"]]

@jch1
Copy link
Contributor

jch1 commented Jul 7, 2017

Hi @Wenstery - It's a bit hard to say without knowing more, but typical ways to deal with this are to reduce the learning rate or to increase the batch size. I also recommend using a GPU if possible as it will significantly decrease your turnaround time on these experiments.

@Wenstery
Copy link
Author

Wenstery commented Jul 7, 2017

Hi @jch1 -It's also happend when training on a GPU,I 'll try to reduce the learning rate.
Thanks a lot!

@timisplump
Copy link

To provide another data point, I had the same bug on both ssd models (while the other 3 models were fine). I will try reducing the learning rate and if the problem persists, I will comment back here.

@jch1
Copy link
Contributor

jch1 commented Jul 7, 2017

Thanks @timisplump - also maybe kick up the batch size to 32 if you can hold it in memory.

@timisplump
Copy link

@jch1 I'm unfortunately training on images that are too large (504x960) to hold a batch size of 32. I'm currently using a batch size of 12. When I moved to 16, the training worked fine until it tried to store a checkpoint, at which point I got an OOM error.

I could downsample the images and might try that later, but for now I'd like to see how the models do with this image size

@cy89 cy89 added stat:awaiting response Waiting on input from the contributor type:support labels Jul 8, 2017
@cy89
Copy link

cy89 commented Jul 8, 2017

@timisplump does reducing the learning rate help?

@timisplump
Copy link

@cy89 well I only started the new task with a lower learning rate about an hour ago but it hasn't crashed. Last time, it crashed after running for about 10 hours, so i can't give you a great response yet. Will let you know when I check back on Monday!

@lzkmylz
Copy link

lzkmylz commented Jul 10, 2017

I had the same problem. The real reason is your annotation have problem. If any boundingbox value is Nan in your annotation will cause this problem. You should go back and check your annotation. That's how I solve this problem when training on my own dataset.

@timisplump
Copy link

@cy89 Sorry for the slow reply. Been busy with other stuff.
The learning rate did not solve the problem. I'm going to see if I can run multi-gpu with a larger batch size and check to see if that will help.
I'm pretty convinced it's not an error with my dataset because the atrous faster rcnn was working fine after a few days of training. Moreover, the other models that crashed (all 4 others), crashed at different points in the dataset/different epochs. It's a pretty strange error. Hope we can find it

@lionel92
Copy link

Same problem with rfcn model. The training process crashed after 573k iterations and reported the same error. @lzkmylz What exactly do you mean for NAN in annotations? NAN is generated by the computation that in annotations there cannot be this kind of number.

@ph463
Copy link

ph463 commented Jul 27, 2017

Had the same problem. I double checked my annotations and found some bounding boxes outside the image boundaries. Fixing the bounding boxes solved the problem for me.

@yeephycho
Copy link

I'm suffering from the same problem.
I cleaned all my annotations, deleted small objects and limit the annotation within the range of 0.005 to 0.995.
But still I will got "Tensor had NaN values" error.
And I found that the error always happen around 3000 steps.
I wonder whether it is related to test or val progress of the object detection API?
And I also tried to reduce the learning rate and change another optimizer, error still occur.
I got a model that can repeat the error, usually, the error happen within 30 steps, sometimes the second step will raise the error by finetune that model.

Anybody got any more experience on this, any other way that I can try?

@yeephycho
Copy link

I noticed that the optimizer for SSD incetion v2 is RMSPropOptimizer.
In the config file line around 136, training config part:
train_config: { batch_size: 24 optimizer { rms_prop_optimizer: { learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.004 decay_steps: 800720 decay_factor: 0.95 } } momentum_optimizer_value: 0.9 decay: 0.9 epsilon: 1.0 } }
epsilon was set to 1.0, but in the official tensorflow document, epsilon should be a very small value to avoid 0 denominator.
I wonder whether this parameter was intentionally set to 1.0?
Could it be the reason for NaN error?
@jch1
Thanks!

@andreabc
Copy link

I kept having this issue with SSD MobileNet_v1 and after fixing bounding boxes out of range and removing objects that were too small (less than 15% of the image) I have had no issues training so far.
Others had similar issues here #1907

@shamanez
Copy link

Is is this a problem allocated with the data set ?

@shamanez
Copy link

Yeah. This loss error occurs when there's some kind of wrong annotation . I tried with different data set it works really fine. It's always good keep a track of following things ,

  1. Image size
  2. Object size
  3. Boundaries of the bounding boxes

@Arsakes
Copy link

Arsakes commented Oct 8, 2017

Do you have any advices on range of values that bboxes should be with?
How small or big can they big ?
I have rather small 40x40 object in dataset and getting nan/inf at start.

Curiously when I apply batch normalization during training, I'm getting no nan/inf but constant loss :(

batch_norm {
    train: true
}

@jerowe
Copy link

jerowe commented Nov 18, 2017

I am also having this problem. My images do not have any sort of orientation, so I am have flipped,flopped,transposed,transversed, and rotated my images by 90,-90, and 180 degrees. It works great on the original dataset, but not on artificially augmented training set.

@Wenstery
Copy link
Author

@jerowe ,You can try to fix the bounding boxes like @ph463," I double checked my annotations and found some bounding boxes outside the image boundaries."

@jerowe
Copy link

jerowe commented Nov 27, 2017

@Wenstery , I took another look back over my dataset. I checked the annotations for boxes outside of the boundaries, and I didn't find any. I am using the faster_rcnn_inception_resnet_v2_atrous_coco model. I do have very small boxes.

Oddly enough I found that when I tried to use the flipped (horizontal), flopped (vertical), transverse (y=x), or rotate 90/180/-90 I got the NaN error, but when I used the transposed dataset it was fine.

I noticed in the newest tensorflow models there are pre processing steps for vertical and rotation 90. There is already a flip horizontal preprocess step in there. Just waiting for the tensorflow 1.4 release on conda to try it out!

@aselle aselle removed the stat:awaiting response Waiting on input from the contributor label Nov 29, 2017
@angerson
Copy link

angerson commented Dec 11, 2017

This issue appears to be resolved, although it looks like the discussion has to do with something that could be clarified in documentation or contributed as a QoL improvement -- if anyone feels strongly about this, please feel free to put together a PR or file a specific feature request. Thanks!

@kalanityL
Copy link

kalanityL commented Mar 20, 2018

Same issue.

DataSet is checked and sound - no empty BB, no BB outside image.

DataSet working fine when fine-tuning ssd_mobilenet_v1.
Crashing with exact same error as reported here when fine-tuning faster_rcnn_inception_resnet_v2.

I removed "small" bounding boxes (with width or high less than 20px).
And now it is ok. Not crashing.

Although it is a little annoying to have to remove those boxes for some of my classes... but at least the training can run.

Thanks all for the help

@hustc12
Copy link

hustc12 commented May 5, 2018

After an investigation, I found the small size of samples are not the root cause of the crash for my own dataset. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.)
Actually, what I found were, in the annotation file, the wrong order of the coordinates. For instance, the annotations will mark the coordinates, named x1, y1, x2 and y2. Here, x1 should less than x2, so do y1 and y2. However, in my case, some of the annotated samples show that x1>x2, or y1>y2. which cause the crash issue. After I correct the order of the coordinates, crash gone. Hope this information can help someone.

@airman00
Copy link
Contributor

airman00 commented Mar 1, 2019

For reference, here's what you need to verify with your annotations:

  1. xmin < xmax AND ymin < ymax
  2. xmax <= width AND ymax <= height
  3. xmin, xmax, ymin, ymax are all positive numbers
  4. boxArea < 1% of imageArea (Not sure of the exact percent, but limiting it to 1% worked for me)

Verify with your label_map.pbtxt file:

  1. Category names in this file must match annotations (don't do any extra or miss any)

Verify with your pipeline config:

  1. Correct number of classes - must match label_map.pbtxt file

Python validation script if you want to use it for annotation files. Just set the directory folder and run the script.
If you want the output of the script for use in a larger bash script, use this
ERROR=$(python tools/annotationsValidator.py 2>&1 >/dev/null)

import os, sys 
import xml.etree.ElementTree as ET
from collections import Counter

classes = []
classNameCountArray = []
directory = 'input_augmented/annotations'


# Get a list of all the files ending in .xml
files = os.listdir(directory)
print "Found ", len(files) , " annotations"

# Open a file
firstFile = files[0]

error = ""
# Get the 
for filename_short in files:
    if (not filename_short.endswith(".xml") ):
        print "Skipping invalid XML file", filename_short
        
    else:
        filename = directory+"/"+filename_short
        tree = ET.parse(filename)
        size = tree.find('size')
        imageWidth = int(size.find('width').text)
        imageHeight = int(size.find('height').text)
        imageArea = imageWidth * imageHeight   

        if ( tree.find('folder').text != "images" ):
            tree.find('folder').text  = "images"
            print "Changing folder name to images"
            error = "ERROR"

        if (".JPG" in tree.find('filename').text ):
            print filename,"Error .jpg to JPG"
            error = "ERROR"

        name = tree.find('object').find('name').text
        if (not (name in classes) ):
            classes.append(name)
        classNameCountArray.append(name)


        boundingBox = tree.find('object').find('bndbox')
        xmin = int( boundingBox.find('xmin').text )
        ymin = int( boundingBox.find('ymin').text )
        xmax = int( boundingBox.find('xmax').text )
        ymax = int( boundingBox.find('ymax').text )


        boxWidth  = xmax - xmin
        boxHeight = ymax - ymin
        boxArea = boxWidth * boxHeight

        # make sure that box size is more than 1%
        if (boxArea < 0.01 * imageArea):
            print filename, "Too Small object"
            error = "ERROR"

        # Make sure that xmin > xmax or ymin > ymax
        if (xmin > xmax or ymin > ymax):
            print filename,"Invalid Min Max relationship",xmin,xmax,ymin,ymax
            error = "ERROR"

        # Make sure that xmax < width and ymax < height
        if (xmax > imageWidth or ymax > imageHeight):
            print filename,"Invalid Limits of Bounding Box",xmin,xmax,ymin,ymax
            error = "ERROR"

        # make sure that everything is positive numbers
        if (xmin <= 0 or xmax <= 0 or ymin <= 0 or ymax <= 0):
            print filename,"Bounding box is zero",xmin,xmax,ymin,ymax
            error = "ERROR"

print "Found  " + str( len(classes) ) + " Classes = ",classes
c = Counter(classNameCountArray)
print c

if (not error):
    print "SUCCESS!"
sys.exit(error)

@shamik111691
Copy link

I have a question. I want to specify background images in my training data. I have read that background images should not have any ground truth boxes. Then how do I specify the CSV file for background images?

@sepehrfard
Copy link

I have ran the same code to check my data and they all pass however, I can't get passed 20 steps for training before the Check Numerics comes up. Could this be an issue that Im using the legacy train file instead of the new model_main.py

@behnamnkp
Copy link

behnamnkp commented Feb 9, 2020

If your training works for a few steps and then you get this (Tensor had NaN values) error, the error has nothing to do with GPUs memory. This error has probably something to do with your dataset. If you have changed the number of classes, it can make a problem. I trained my own dataset on top of Xception_65 with 6 classes and it worked pretty well. However, when I changed the annotation to two classes (building and background), I started to get this error. But then I realized that makes total sense because, in the process of reclassifying the training datasets, I had some images that ended up being all in background class (I had no building in those images and all pixels are in 0 value ).
Thus, I removed those images from my train and val datasets and built the tfrecord datasets again. See if that works for you too.

@aachoo
Copy link

aachoo commented Feb 12, 2020

I am getting the same issue with the models/research/object_detection/legacy/train.py with TensorFlow 1.15.
I have looked through my data and all bboxes are within the image dimensions and xmin, ymin, xmax and ymax are in the correct order.
All my images are 1000x877 and all my bbox have a size of 100x100.
I have tried to increase batch size and decrease learning rate, but nothing has changed the outcome.

Just wondering if there is anything else I should try to fix this issue.
Would appreciate any help!
Thanks!

UPDATE:
I cloned the models repository again and deleted the old one.
Now it works great.

Not sure why it worked with the new one though...
Maybe some files were updated? Or does the files keep track of other trainings and got confused?
Should I be cloning a new models respository each time I train something?

@sainisanjay
Copy link

ERROR=$(python tools/annotationsValidator.py 2>&1 >/dev/null)

@airman00, i run your script for annotation validation. I got below error may i know what does it mean:

Found  5395  annotations
Traceback (most recent call last):
  File "annotationvalidation.py", line 63, in <module>
    xmin = int( boundingBox.find('xmin').text )
ValueError: invalid literal for int() with base 10: '988.0'

I set the annotation .xml file directory path in script.

@aachoo
Copy link

aachoo commented Mar 19, 2020

@sainisanjay try float( ) instead of int( )

@sainisanjay
Copy link

xmin

what will be the issue if my annotated bounding box coordinate with 0?
For example: My image size is [1820, 940] and annotated object coordinates are
[0.0 0.16052623 473.51367 574.22736]

@JonBlanco11
Copy link

The problem is in the data augmentation, delete the crop function. I only let the split.

@pranatitrivedi8
Copy link

For reference, here's what you need to verify with your annotations:

  1. xmin < xmax AND ymin < ymax
  2. xmax <= width AND ymax <= height
  3. xmin, xmax, ymin, ymax are all positive numbers
  4. boxArea < 1% of imageArea (Not sure of the exact percent, but limiting it to 1% worked for me)

Verify with your label_map.pbtxt file:

  1. Category names in this file must match annotations (don't do any extra or miss any)

Verify with your pipeline config:

  1. Correct number of classes - must match label_map.pbtxt file

Python validation script if you want to use it for annotation files. Just set the directory folder and run the script.
If you want the output of the script for use in a larger bash script, use this
ERROR=$(python tools/annotationsValidator.py 2>&1 >/dev/null)

import os, sys 
import xml.etree.ElementTree as ET
from collections import Counter

classes = []
classNameCountArray = []
directory = 'input_augmented/annotations'


# Get a list of all the files ending in .xml
files = os.listdir(directory)
print "Found ", len(files) , " annotations"

# Open a file
firstFile = files[0]

error = ""
# Get the 
for filename_short in files:
    if (not filename_short.endswith(".xml") ):
        print "Skipping invalid XML file", filename_short
        
    else:
        filename = directory+"/"+filename_short
        tree = ET.parse(filename)
        size = tree.find('size')
        imageWidth = int(size.find('width').text)
        imageHeight = int(size.find('height').text)
        imageArea = imageWidth * imageHeight   

        if ( tree.find('folder').text != "images" ):
            tree.find('folder').text  = "images"
            print "Changing folder name to images"
            error = "ERROR"

        if (".JPG" in tree.find('filename').text ):
            print filename,"Error .jpg to JPG"
            error = "ERROR"

        name = tree.find('object').find('name').text
        if (not (name in classes) ):
            classes.append(name)
        classNameCountArray.append(name)


        boundingBox = tree.find('object').find('bndbox')
        xmin = int( boundingBox.find('xmin').text )
        ymin = int( boundingBox.find('ymin').text )
        xmax = int( boundingBox.find('xmax').text )
        ymax = int( boundingBox.find('ymax').text )


        boxWidth  = xmax - xmin
        boxHeight = ymax - ymin
        boxArea = boxWidth * boxHeight

        # make sure that box size is more than 1%
        if (boxArea < 0.01 * imageArea):
            print filename, "Too Small object"
            error = "ERROR"

        # Make sure that xmin > xmax or ymin > ymax
        if (xmin > xmax or ymin > ymax):
            print filename,"Invalid Min Max relationship",xmin,xmax,ymin,ymax
            error = "ERROR"

        # Make sure that xmax < width and ymax < height
        if (xmax > imageWidth or ymax > imageHeight):
            print filename,"Invalid Limits of Bounding Box",xmin,xmax,ymin,ymax
            error = "ERROR"

        # make sure that everything is positive numbers
        if (xmin <= 0 or xmax <= 0 or ymin <= 0 or ymax <= 0):
            print filename,"Bounding box is zero",xmin,xmax,ymin,ymax
            error = "ERROR"

print "Found  " + str( len(classes) ) + " Classes = ",classes
c = Counter(classNameCountArray)
print c

if (not error):
    print "SUCCESS!"
sys.exit(error)

Hi @airman00 , I got the below error. Could you help me to know why I am getting?

......
Changing folder name to images
Traceback (most recent call last):
File "annotationsValidator.py", line 46, in
boundingBox = tree.find('object').find('bndbox')
AttributeError: 'NoneType' object has no attribute 'find'

@sainisanjay
Copy link

@pranatitrivedi8 look like some of your xml file don't have bndbox object that's why it is giving error AttributeError: 'NoneType' object has no attribute 'find'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests