-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LossTensor is nan while training any model on Caltech dataset #1907
Comments
@yossibiton I notice in your config file that you've set batch_size to be 1 and learning rate to be zero? Is that a possible culprit? In our released configs, I believe we have batch size of 24. (And of course non-zero learning rates) |
Hi @jch1, I would be thankful if you could run the train script on your environment and see if it fails there. |
Possible reason may be some small objects (15x30 pixels), which doesn't fit any of the anchors generated by SSD. |
Thanks for reaching out @yossibiton but this issue tracker is for bugs and feature requests. Consider reaching out to StackOverflow since there is a larger community that reads questions there. |
@jart |
NaNs can happen for a variety of reasons. It would be helpful to see more tracebacks and logs. The Caltech dataset isn't included as an example in the models repository and your Drive folder appears to have configurations you've written yourself. There may be a bug, but it's hard to tell by reading what the bug is. If you can help us identify the bug, then we're absolutely interested in solving it. Just please understand that we don't have the resources to provide support on using these models. That's what StackOverflow is for. |
@yossibiton Were you able to identify the culprit for the NaN/inf error was? I'm getting the same one and I hadn't even thought that it may be due to having small objects in my training set. |
After deleting small objects from my dataset the code stopped crashing.
I know it doesn't make any sense
…On Tue, Jul 25, 2017, 22:59 Sharif Shameem ***@***.***> wrote:
@yossibiton <https://github.com/yossibiton> Were you able to identify the
culprit for the NaN/inf error was? I'm getting the same one and I hadn't
even thought that it may be due to having small objects in my training set.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1907 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGN1AIglAnpl0_BrvP6zhXDIJGfgZC46ks5sRkkggaJpZM4OSxMY>
.
|
That's really weird, I have quite a few small objects in my dataset. I've been trying to figure out what was wrong for a while now. Do you remember what dimensions you chose for the smallest object cutoff? |
The smallest object i have used has height which is about 15% of image height. |
My training doesn't even start and throws |
I had the same error, and after removing small objects (less than 15% of width/height) and making sure normalized bounding boxes were between 0 and 1 I haven't had any problems training so far |
@yossibiton, Hi, |
I removed some of the samples in my dataset (whose size is less than 15% of the width and height), and seems that the issue is gone. |
System information
python train.py --logtostderr --pipeline_config_path="ssd_mobilenet_v1_caltech-nodifficult.config" --train_dir="train"
The problem
After training successfully a model over the Pet dataset i moved on and tried to train a Pedestrian Detection model over Caltech.
However the training fails with "", no matter what model or parameters i'm using. This is the error message :
2017-07-10 15:16:44.245498: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values
For some reason Tensorflow fails to process some samples in the dataset although the images and annotations are totally fine.
Below i have attached small part of the dataset which reproduce the error, although i can find many other images in the dataset which reproduce the same error.
Source code / logs
I have shared a Drive folder with the following files :
https://drive.google.com/drive/folders/0B_FKANmkiMlxY0RxVWZiVE1KX00?usp=sharing
This is the first image (size 480x640) in the dataset file attached, with the annotations on it -
annotations (normalized) :
xmin = [0.5734, 0.6312, 0.6218, 0.3531]
xmax = [0.5906, 0.6516, 0.6359, 0.375]
ymin = [0.3375, 0.3437, 0.3396, 0.3458]
ymax = [0.4146, 0.4146, 0.4125, 0.4312]
The text was updated successfully, but these errors were encountered: