Questions about transfer learning and training loss = nan #185

jackyvr · 2020-02-21T17:06:34Z

Thanks for the code. I am doing transfer learning with the yolov3 tf2 model using my own dataset (only one custom class - outside coco). Does the transfer learning function work in my case?

When I put in everything and trained a new model, I got a loss = nan. Below is the log. Could you point me to the problem? Thanks!

 57/Unknown - 54s 945ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan2020-02-21 09:01:43.873721: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]

2020-02-21 09:01:43.873761: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_0_loss/Shape_1/_12]]
2020-02-21 09:01:51.458659: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-02-21 09:01:51.459038: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_1_loss/Shape_1/_14]]
D:\software\conda_envs\tf2\lib\site-packages\tensorflow_core\python\keras\callbacks.py:1806: RuntimeWarning: invalid value encountered in less
self.monitor_op = lambda a, b: np.less(a, b - self.min_delta)
D:\software\conda_envs\tf2\lib\site-packages\tensorflow_core\python\keras\callbacks.py:1225: RuntimeWarning: invalid value encountered in less
if self.monitor_op(current - self.min_delta, self.best):

Epoch 00001: saving model to checkpoints/yolov3_train_1.tf
57/57 [==============================] - 64s 1s/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan - val_yolo_output_2_loss: nan
Epoch 2/10
1/7 [===>..........................] - ETA: 30s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na2/7 [=======>......................] - ETA: 13s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na3/7 [===========>..................] - ETA: 8s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan4/7 [================>.............] - ETA: 4s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan5/7 [====================>.........] - ETA: 2s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan6/7 [========================>.....] - ETA: 1s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan2020-02-21 09:02:25.001690: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-02-21 09:02:25.001806: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_0_loss/Shape_1/_12]]
2020-02-21 09:02:29.828009: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-02-21 09:02:29.828422: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_1_loss/Shape_1/_14]]

Epoch 00002: saving model to checkpoints/yolov3_train_2.tf
57/57 [==============================] - 38s 673ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan - val_yolo_output_2_loss: nan
Epoch 3/10
1/7 [===>..........................] - ETA: 30s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na2/7 [=======>......................] - ETA: 14s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na3/7 [===========>..................] - ETA: 8s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan4/7 [================>.............] - ETA: 5s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan5/7 [====================>.........] - ETA: 2s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan6/7 [========================>.....] - ETA: 1s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan

The text was updated successfully, but these errors were encountered:

chenminni · 2020-02-21T18:31:35Z

I encountered this issue as well. It was resolved by providing more training data. You can use some data augmentation to increase your dataset size.

PieroCV · 2020-02-24T14:31:45Z

Hi!
There are some issues related to this. Probably, if you labeled your dataset with VoTT, some images could be corrupted (especially if you used autolabeling). The Tfrecord files have an item called class_name and an index. The index must always identify the class_name and be the same. This could be one of the problems.
The second problem is the .name file. It could be sound obvious, but if you give a wrong .name file to the Yolo trainer, nan loss will happen.

Finally, the fine-tune mode of training get 80 classes, but if you provide just one, there will be no problem.

If this cases are not yours, reply this issue in order to help you.

jackyvr · 2020-03-12T19:36:13Z

Thanks for the answers both!

I used 1000 images for transfer learning. Do I still need to increase the number of images? @chenminni , how many images did you use to resolve the problem?

@PieroCV sorry, I don't get your last suggestion "Finally, the fine-tune mode of training get 80 classes, but if you provide just one, there will be no problem." I am training for 1 class. What should I do?

Thanks!

PieroCV · 2020-03-12T20:14:18Z

Hi, @jackyvr.
If your tfrecord files are corrupted, the images amount doesn't really matter. The train.py won't work.
First, i would like to know how are you getting your data:

Are you using VoTT,LabelImage or something else to generate your tfrecord files?
Did you modify the repo (I saw Binnary Crossentropy modifications, but it is not necessary)?
What is the content of your .names file?
Did you pass the parameters correctly when training?
Could you verify the content of one tfrecord file?

For 5 point, use this:

filenames = ["<filename>"] #Replace here
raw_dataset = tf.data.TFRecordDataset(filenames)  
for raw_record in raw_dataset.take(1):  
  example = tf.train.Example()  
  example.ParseFromString(raw_record.numpy())  
  print(example)

I hope you could answer as soon as possible in order to help you.

jackyvr · 2020-03-12T23:24:54Z

Thanks a lot, @PieroCV!

1. Are you using VoTT,LabelImage or something else to generate your tfrecord files?
I used OIDv4_ToolKit to generate the tfrecord files. I downloaded one class with bbox info from OID.

2. Did you modify the repo (I saw Binnary Crossentropy modifications, but it is not necessary)?
I added only printouts for debugging.

3. What is the content of your .names file?
Only one line in my *.names (the name of the class):
glasses

4. Did you pass the parameters correctly when training?
Here is the script that I used for training:

5. Could you verify the content of one tfrecord file?
Thanks for the code. I ran the code and the output looked ok. Here is the output
features {
feature {
key: "image/encoded"
value {
bytes_list {
value: "\377\330\377\340\000\020JFIF\000\001\001\001\001,\001,\000\000\377\342\002$ICC_PROFILE\000\001\001\000\000\002\024NKON\002 \000\000mntrRGB XYZ \007\331\000\002\000\024\000\020\0002\000\nacspAPPL\000\000\000\000none\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\366\326\000\001\000\000\000\000\323-\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\tdesc\000\000\001p\000\000\000vrXYZ\000\000\000\360\000\000\000\024gXYZ\000\000\001\004\000\000\000\024bXYZ\000\000\001\030\000\000\000\024rTRC\000\000\001,\000\000\000\016gTRC\000\000\001<\000\000\000\016bTRC\000\000\001L\000\000\000\016wtpt\000\000\001\\000\000\000\024cprt\000\000\001\350\000\000\000,XYZ \000\000\000\000\000\000\234\031\000\000O\246\000\000\004\374XYZ \000\000\000\000\000\0004\213\000\000\240+\000\000\017\225XYZ \000\000\000\000\000\000&2\000\000\020/\000\000\276\240curv\000\000\000\000\000\000\000\001\0023\000\000curv\000\000\000\000\000\000\000\001\0023\000\000curv\000\000\000\000\000\000\000\001\0023\000\000XYZ \000\000\000\000\000\000\363T\000\001\000\000\000\001\026\317desc\000\000\000\000\000\000\000\033Nikon Adobe RGB 4.0.0.3001\000\000\000\000\000\000\000\000\000\000\000\033Nikon Adobe RGB ...[I removed these values... a lot of them]"
}
}
}
feature {
key: "image/filename"
value {
bytes_list {
value: "00022a6311159428"
}
}
}
feature {
key: "image/format"
value {
bytes_list {
value: "jpg"
}
}
}
feature {
key: "image/height"
value {
int64_list {
value: 819
}
}
}
feature {
key: "image/object/bbox/xmax"
value {
float_list {
value: 0.6768749952316284
}
}
}
feature {
key: "image/object/bbox/xmin"
value {
float_list {
value: 0.5056250095367432
}
}
}
feature {
key: "image/object/bbox/ymax"
value {
float_list {
value: 0.35781198740005493
}
}
}
feature {
key: "image/object/bbox/ymin"
value {
float_list {
value: 0.283594012260437
}
}
}
feature {
key: "image/object/class/label"
value {
int64_list {
value: 1
}
}
}
feature {
key: "image/object/class/text"
value {
bytes_list {
value: "Glasses"
}
}
}
feature {
key: "image/source_id"
value {
bytes_list {
value: "00022a6311159428"
}
}
}
feature {
key: "image/width"
value {
int64_list {
value: 1024
}
}
}
}

PieroCV · 2020-03-12T23:30:15Z

The first thing that I can see is the Upper case on tfrecord file. Change the .names file to "Glasses". I'm kind of busy right now, but i will check the other answers later.

jackyvr · 2020-03-12T23:31:58Z

Thanks @PieroCV . Will do.

jackyvr · 2020-03-12T23:38:31Z

@PieroCV , you are amazing! With "glasses" changed to "Glasses" in .names, I am getting non-nan loss now! Why lower case does not work?
Thanks so much!

jackyvr · 2020-03-12T23:42:03Z

Oh, I see. My folder name has upper case "Glasses". Thanks!

PieroCV · 2020-03-12T23:45:20Z

@jackyvr no problem!
It doesn't mean that you can't use lower case on .names, it means that you need to match your tfrecord tag with your classes into your .names
I managed to make this repo works on my pc, so I had to solve some issues by myself.
Happy coding!

jackyvr · 2020-03-13T00:05:18Z

Thank you @PieroCV . Although now I can train with my own data, the generated model does not pick up any objects that I want - even when I feed in a training image. Probably I need to tune the hyper parameters? What loss value is a good value to stop? Sorry, I am new to dl.

jackyvr closed this as completed Mar 16, 2020

PieroCV mentioned this issue Apr 9, 2020

Problem with loss = nan during training #214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about transfer learning and training loss = nan #185

Questions about transfer learning and training loss = nan #185

jackyvr commented Feb 21, 2020

chenminni commented Feb 21, 2020

PieroCV commented Feb 24, 2020

jackyvr commented Mar 12, 2020 •

edited

PieroCV commented Mar 12, 2020

jackyvr commented Mar 12, 2020 •

edited

PieroCV commented Mar 12, 2020

jackyvr commented Mar 12, 2020

jackyvr commented Mar 12, 2020

jackyvr commented Mar 12, 2020 •

edited

PieroCV commented Mar 12, 2020

jackyvr commented Mar 13, 2020

Questions about transfer learning and training loss = nan #185

Questions about transfer learning and training loss = nan #185

Comments

jackyvr commented Feb 21, 2020

chenminni commented Feb 21, 2020

PieroCV commented Feb 24, 2020

jackyvr commented Mar 12, 2020 • edited

PieroCV commented Mar 12, 2020

jackyvr commented Mar 12, 2020 • edited

PieroCV commented Mar 12, 2020

jackyvr commented Mar 12, 2020

jackyvr commented Mar 12, 2020

jackyvr commented Mar 12, 2020 • edited

PieroCV commented Mar 12, 2020

jackyvr commented Mar 13, 2020

jackyvr commented Mar 12, 2020 •

edited

jackyvr commented Mar 12, 2020 •

edited

jackyvr commented Mar 12, 2020 •

edited