LossTensor is nan while training any model on Caltech dataset #1907

yossibiton · 2017-07-10T12:30:48Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): used the official train script
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): binary, using "pip install tensorflow-gpu"
TensorFlow version (use command below): 1.2.0
CUDA/cuDNN version: CUDA 8.0, cuDNN 5.1
GPU model and memory: GPU GTX 1070 (8 Gb)
Exact command to reproduce:
python train.py --logtostderr --pipeline_config_path="ssd_mobilenet_v1_caltech-nodifficult.config" --train_dir="train"

The problem

After training successfully a model over the Pet dataset i moved on and tried to train a Pedestrian Detection model over Caltech.
However the training fails with "", no matter what model or parameters i'm using. This is the error message :
2017-07-10 15:16:44.245498: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values

For some reason Tensorflow fails to process some samples in the dataset although the images and annotations are totally fine.
Below i have attached small part of the dataset which reproduce the error, although i can find many other images in the dataset which reproduce the same error.

Source code / logs

I have shared a Drive folder with the following files :

caltech_train_no-difficult.record : dataset file, i put only 3 annotated images here (taken from Caltech). The train script fails in the first batch.
caltech_label_map.pbtxt : defining 0/1 labels
ssd_mobilenet_v1_caltech-nodifficult.config : main config file (you should change PATH_TO_BE_CONFIGURED to the folder path, where you downloaded the 2 others files)

https://drive.google.com/drive/folders/0B_FKANmkiMlxY0RxVWZiVE1KX00?usp=sharing

This is the first image (size 480x640) in the dataset file attached, with the annotations on it -
annotations (normalized) :
xmin = [0.5734, 0.6312, 0.6218, 0.3531]
xmax = [0.5906, 0.6516, 0.6359, 0.375]
ymin = [0.3375, 0.3437, 0.3396, 0.3458]
ymax = [0.4146, 0.4146, 0.4125, 0.4312]

jch1 · 2017-07-11T03:56:52Z

@yossibiton I notice in your config file that you've set batch_size to be 1 and learning rate to be zero? Is that a possible culprit? In our released configs, I believe we have batch size of 24. (And of course non-zero learning rates)

yossibiton · 2017-07-11T05:19:31Z

Hi @jch1,
I chose zero learning rate just for the debugging (to eliminate other possible causes for the Nan). The problem exists for different values of learning rate and batch size.

I would be thankful if you could run the train script on your environment and see if it fails there.

yossibiton · 2017-07-13T07:59:23Z

Possible reason may be some small objects (15x30 pixels), which doesn't fit any of the anchors generated by SSD.
However, i can't understand why it should crash tensorflow such dramatically.

jart · 2017-07-13T23:15:13Z

Thanks for reaching out @yossibiton but this issue tracker is for bugs and feature requests. Consider reaching out to StackOverflow since there is a larger community that reads questions there.

yossibiton · 2017-07-14T01:23:09Z

@jart
I posted here since I do believe this is a bug, and the issue shouldn't be closed immediately.
Feeding a clean dataset into the system shouldn't end with such a crash.
I also provided all the necessary information for reproducing this crash.

jart · 2017-07-14T03:00:35Z

NaNs can happen for a variety of reasons. It would be helpful to see more tracebacks and logs. The Caltech dataset isn't included as an example in the models repository and your Drive folder appears to have configurations you've written yourself. There may be a bug, but it's hard to tell by reading what the bug is. If you can help us identify the bug, then we're absolutely interested in solving it. Just please understand that we don't have the resources to provide support on using these models. That's what StackOverflow is for.

Samin100 · 2017-07-25T19:57:37Z

@yossibiton Were you able to identify the culprit for the NaN/inf error was? I'm getting the same one and I hadn't even thought that it may be due to having small objects in my training set.

yossibiton · 2017-07-25T20:03:13Z

After deleting small objects from my dataset the code stopped crashing. I know it doesn't make any sense

…

On Tue, Jul 25, 2017, 22:59 Sharif Shameem ***@***.***> wrote: @yossibiton <https://github.com/yossibiton> Were you able to identify the culprit for the NaN/inf error was? I'm getting the same one and I hadn't even thought that it may be due to having small objects in my training set. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1907 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGN1AIglAnpl0_BrvP6zhXDIJGfgZC46ks5sRkkggaJpZM4OSxMY> .

Samin100 · 2017-07-25T20:10:21Z

That's really weird, I have quite a few small objects in my dataset. I've been trying to figure out what was wrong for a while now. Do you remember what dimensions you chose for the smallest object cutoff?

yossibiton · 2017-07-27T06:27:11Z

The smallest object i have used has height which is about 15% of image height.
But i can't be sure that small objects is the problem here, so don't take this number too seriously.

deepankverma · 2017-08-10T07:07:07Z

My training doesn't even start and throws LossTensor is inf or nan. : Tensor had NaN values. Now I know, it's because I have very small objects in the dataset, mostly 15 x 30 px. Will try to subdivide the images so that the objects scale up.

andreabc · 2017-08-23T00:26:49Z

I had the same error, and after removing small objects (less than 15% of width/height) and making sure normalized bounding boxes were between 0 and 1 I haven't had any problems training so far

PythonImageDeveloper · 2018-03-05T08:40:52Z

@yossibiton, Hi,
are you solved your problem ? how do you convert caltech dataset to record file, please give me reference or doing step by step for this.

hustc12 · 2018-05-04T21:17:31Z

I removed some of the samples in my dataset (whose size is less than 15% of the width and height), and seems that the issue is gone.
UPDATE: After an investigation, I found the small size of samples are not the root cause of the crash. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.)
Actually, what I found were, in the annotation file, the wrong order of the coordinates. For instance, the annotations will mark the coordinates, named x1, y1, x2 and y2. Here, x1 should less than x2, so do y1 and y2. However, in my case, some of the annotated samples show that x1>x2, or y1>y2. which cause the crash issue. After I correct the order of the coordinates, crash gone. Hope this information can help someone.

jart closed this as completed Jul 13, 2017

andreabc mentioned this issue Aug 23, 2017

LossTensor is inf or nan while training ssd_inception_v2 model in my own dataset. #1881

Closed

Bidski mentioned this issue Mar 22, 2018

LossTensor is inf or nan while training ssd_mobilenet_v1_coco model in my own dataset #3688

Closed

hammadullah125 mentioned this issue Oct 13, 2018

LossTensor is inf or nan. : Tensor had NaN values #5516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LossTensor is nan while training any model on Caltech dataset #1907

LossTensor is nan while training any model on Caltech dataset #1907

yossibiton commented Jul 10, 2017 •

edited

Loading

jch1 commented Jul 11, 2017

yossibiton commented Jul 11, 2017 •

edited

Loading

yossibiton commented Jul 13, 2017

jart commented Jul 13, 2017

yossibiton commented Jul 14, 2017

jart commented Jul 14, 2017

Samin100 commented Jul 25, 2017

yossibiton commented Jul 25, 2017 via email

Samin100 commented Jul 25, 2017

yossibiton commented Jul 27, 2017

deepankverma commented Aug 10, 2017

andreabc commented Aug 23, 2017

PythonImageDeveloper commented Mar 5, 2018 •

edited

Loading

hustc12 commented May 4, 2018 •

edited

Loading

LossTensor is nan while training any model on Caltech dataset #1907

LossTensor is nan while training any model on Caltech dataset #1907

Comments

yossibiton commented Jul 10, 2017 • edited Loading

System information

The problem

Source code / logs

jch1 commented Jul 11, 2017

yossibiton commented Jul 11, 2017 • edited Loading

yossibiton commented Jul 13, 2017

jart commented Jul 13, 2017

yossibiton commented Jul 14, 2017

jart commented Jul 14, 2017

Samin100 commented Jul 25, 2017

yossibiton commented Jul 25, 2017 via email

Samin100 commented Jul 25, 2017

yossibiton commented Jul 27, 2017

deepankverma commented Aug 10, 2017

andreabc commented Aug 23, 2017

PythonImageDeveloper commented Mar 5, 2018 • edited Loading

hustc12 commented May 4, 2018 • edited Loading

yossibiton commented Jul 10, 2017 •

edited

Loading

yossibiton commented Jul 11, 2017 •

edited

Loading

PythonImageDeveloper commented Mar 5, 2018 •

edited

Loading

hustc12 commented May 4, 2018 •

edited

Loading