Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LossTensor is nan while training any model on Caltech dataset #1907

Closed
yossibiton opened this issue Jul 10, 2017 · 14 comments
Closed

LossTensor is nan while training any model on Caltech dataset #1907

yossibiton opened this issue Jul 10, 2017 · 14 comments

Comments

@yossibiton
Copy link

yossibiton commented Jul 10, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): used the official train script
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary, using "pip install tensorflow-gpu"
  • TensorFlow version (use command below): 1.2.0
  • CUDA/cuDNN version: CUDA 8.0, cuDNN 5.1
  • GPU model and memory: GPU GTX 1070 (8 Gb)
  • Exact command to reproduce:
    python train.py --logtostderr --pipeline_config_path="ssd_mobilenet_v1_caltech-nodifficult.config" --train_dir="train"

The problem

After training successfully a model over the Pet dataset i moved on and tried to train a Pedestrian Detection model over Caltech.
However the training fails with "", no matter what model or parameters i'm using. This is the error message :
2017-07-10 15:16:44.245498: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values

For some reason Tensorflow fails to process some samples in the dataset although the images and annotations are totally fine.
Below i have attached small part of the dataset which reproduce the error, although i can find many other images in the dataset which reproduce the same error.

Source code / logs

I have shared a Drive folder with the following files :

  1. caltech_train_no-difficult.record : dataset file, i put only 3 annotated images here (taken from Caltech). The train script fails in the first batch.
  2. caltech_label_map.pbtxt : defining 0/1 labels
  3. ssd_mobilenet_v1_caltech-nodifficult.config : main config file (you should change PATH_TO_BE_CONFIGURED to the folder path, where you downloaded the 2 others files)

https://drive.google.com/drive/folders/0B_FKANmkiMlxY0RxVWZiVE1KX00?usp=sharing

This is the first image (size 480x640) in the dataset file attached, with the annotations on it -
annotations (normalized) :
xmin = [0.5734, 0.6312, 0.6218, 0.3531]
xmax = [0.5906, 0.6516, 0.6359, 0.375]
ymin = [0.3375, 0.3437, 0.3396, 0.3458]
ymax = [0.4146, 0.4146, 0.4125, 0.4312]

caltech_sample

@jch1
Copy link
Contributor

jch1 commented Jul 11, 2017

@yossibiton I notice in your config file that you've set batch_size to be 1 and learning rate to be zero? Is that a possible culprit? In our released configs, I believe we have batch size of 24. (And of course non-zero learning rates)

@yossibiton
Copy link
Author

yossibiton commented Jul 11, 2017

Hi @jch1,
I chose zero learning rate just for the debugging (to eliminate other possible causes for the Nan). The problem exists for different values of learning rate and batch size.

I would be thankful if you could run the train script on your environment and see if it fails there.

@yossibiton
Copy link
Author

Possible reason may be some small objects (15x30 pixels), which doesn't fit any of the anchors generated by SSD.
However, i can't understand why it should crash tensorflow such dramatically.

@jart
Copy link
Contributor

jart commented Jul 13, 2017

Thanks for reaching out @yossibiton but this issue tracker is for bugs and feature requests. Consider reaching out to StackOverflow since there is a larger community that reads questions there.

@jart jart closed this as completed Jul 13, 2017
@yossibiton
Copy link
Author

@jart
I posted here since I do believe this is a bug, and the issue shouldn't be closed immediately.
Feeding a clean dataset into the system shouldn't end with such a crash.
I also provided all the necessary information for reproducing this crash.

@jart
Copy link
Contributor

jart commented Jul 14, 2017

NaNs can happen for a variety of reasons. It would be helpful to see more tracebacks and logs. The Caltech dataset isn't included as an example in the models repository and your Drive folder appears to have configurations you've written yourself. There may be a bug, but it's hard to tell by reading what the bug is. If you can help us identify the bug, then we're absolutely interested in solving it. Just please understand that we don't have the resources to provide support on using these models. That's what StackOverflow is for.

@Samin100
Copy link

@yossibiton Were you able to identify the culprit for the NaN/inf error was? I'm getting the same one and I hadn't even thought that it may be due to having small objects in my training set.

@yossibiton
Copy link
Author

yossibiton commented Jul 25, 2017 via email

@Samin100
Copy link

That's really weird, I have quite a few small objects in my dataset. I've been trying to figure out what was wrong for a while now. Do you remember what dimensions you chose for the smallest object cutoff?

@yossibiton
Copy link
Author

The smallest object i have used has height which is about 15% of image height.
But i can't be sure that small objects is the problem here, so don't take this number too seriously.

@deepankverma
Copy link

My training doesn't even start and throws LossTensor is inf or nan. : Tensor had NaN values. Now I know, it's because I have very small objects in the dataset, mostly 15 x 30 px. Will try to subdivide the images so that the objects scale up.

@andreabc
Copy link

I had the same error, and after removing small objects (less than 15% of width/height) and making sure normalized bounding boxes were between 0 and 1 I haven't had any problems training so far

@PythonImageDeveloper
Copy link

PythonImageDeveloper commented Mar 5, 2018

@yossibiton, Hi,
are you solved your problem ? how do you convert caltech dataset to record file, please give me reference or doing step by step for this.

@hustc12
Copy link

hustc12 commented May 4, 2018

I removed some of the samples in my dataset (whose size is less than 15% of the width and height), and seems that the issue is gone.
UPDATE: After an investigation, I found the small size of samples are not the root cause of the crash. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.)
Actually, what I found were, in the annotation file, the wrong order of the coordinates. For instance, the annotations will mark the coordinates, named x1, y1, x2 and y2. Here, x1 should less than x2, so do y1 and y2. However, in my case, some of the annotated samples show that x1>x2, or y1>y2. which cause the crash issue. After I correct the order of the coordinates, crash gone. Hope this information can help someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants