Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training: invalid value encountered in less + nan's #26

Closed
ghost opened this issue Jun 20, 2019 · 7 comments
Closed

Training: invalid value encountered in less + nan's #26

ghost opened this issue Jun 20, 2019 · 7 comments

Comments

@ghost
Copy link

ghost commented Jun 20, 2019

python train.py --batch_size 8 --dataset=C:\...\platt.record --val_dataset=C:\...\platt_val.record --epochs 10 --mode eager_fit --transfer fine_tune --weights ./checkpoints/yolov3-tiny.tf --tiny

results in this output:

Epoch 1/10
2019-06-20 02:13:00.680170: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profile Session started.
2019-06-20 02:13:00.685371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library cupti64_100.dll
      1/Unknown - 4s 4s/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nanW0620 02:13:01.387073  9828 callbacks.py:236] Method (on_train_batch_end) is slow compared to the batch update (0.256449). Check your callbacks.
      7/Unknown - 6s 807ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nanC:\...\Anaconda3\envs\yolov3-tf2\lib\site-packages\tensorflow\python\keras\callbacks.py:1467: RuntimeWarning: invalid value encountered in less
  self.monitor_op = lambda a, b: np.less(a, b - self.min_delta)
C:\...\Anaconda3\envs\yolov3-tf2\lib\site-packages\tensorflow\python\keras\callbacks.py:979: RuntimeWarning: invalid value encountered in less
  if self.monitor_op(current - self.min_delta, self.best):

Epoch 00001: saving model to checkpoints/yolov3_train_1.tf
7/7 [==============================] - 7s 1s/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan
Epoch 2/10
6/7 [========================>.....] - ETA: 0s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan
Epoch 00002: saving model to checkpoints/yolov3_train_2.tf
7/7 [==============================] - 3s 394ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan
Epoch 3/10
6/7 [========================>.....] - ETA: 0s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan
Epoch 00003: saving model to checkpoints/yolov3_train_3.tf
7/7 [==============================] - 3s 396ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan
Epoch 00003: early stopping

What might be the cause for that? Also there are other open issues regarding training and I'm wondering if anyone was successfull.

@dgarkov
Copy link

dgarkov commented Jun 30, 2019

I can confirm the issue. I have tested it on two different setups - one with tiny and one with notiny. Although I didn't get nans right from the start, both ended up with the above outcome.

@zzh8829
Copy link
Owner

zzh8829 commented Aug 2, 2019

can you paste some of your sample data here, its hard to tell without training data

@ghost
Copy link
Author

ghost commented Aug 19, 2019

can you paste some of your sample data here, its hard to tell without training data

Unfortunately it was a long time ago and I switched. I'm not using this project anymore. Thank you for reply :)

@ghost ghost closed this as completed Aug 19, 2019
@samratkokula
Copy link

I am seeing a similar error. My sample data looks like below, this is before converting it to tfrecord

img1.jpeg 0.2901965,0.492121,0.4980395,0.576363,0 0.500981,0.495151,0.701961,0.573333,0 0.6464709999999999,0.5275755,0.696079,0.5809085,1
img2.jpeg 0.259094,0.4052765,0.416548,0.49675549999999996,0 0.417618,0.403979,0.5686519999999999,0.500649,0

Can you please help

@AnaRhisT94
Copy link

Hi @samratkokula ,
have you managed to fix the issue?

@IlkayW
Copy link

IlkayW commented Feb 6, 2020

I'm encountering the same issue.
It seems to occur randomly ...
Is there a quick way to fix it?

@julio-ruepp
Copy link

I had the same failure, the problem was that there were some nan's in my data ;)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants