Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about transfer learning and training loss = nan #185

Closed
jackyvr opened this issue Feb 21, 2020 · 11 comments
Closed

Questions about transfer learning and training loss = nan #185

jackyvr opened this issue Feb 21, 2020 · 11 comments

Comments

@jackyvr
Copy link

jackyvr commented Feb 21, 2020

Thanks for the code. I am doing transfer learning with the yolov3 tf2 model using my own dataset (only one custom class - outside coco). Does the transfer learning function work in my case?

When I put in everything and trained a new model, I got a loss = nan. Below is the log. Could you point me to the problem? Thanks!


 57/Unknown - 54s 945ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan2020-02-21 09:01:43.873721: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]

2020-02-21 09:01:43.873761: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_0_loss/Shape_1/_12]]
2020-02-21 09:01:51.458659: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-02-21 09:01:51.459038: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_1_loss/Shape_1/_14]]
D:\software\conda_envs\tf2\lib\site-packages\tensorflow_core\python\keras\callbacks.py:1806: RuntimeWarning: invalid value encountered in less
self.monitor_op = lambda a, b: np.less(a, b - self.min_delta)
D:\software\conda_envs\tf2\lib\site-packages\tensorflow_core\python\keras\callbacks.py:1225: RuntimeWarning: invalid value encountered in less
if self.monitor_op(current - self.min_delta, self.best):

Epoch 00001: saving model to checkpoints/yolov3_train_1.tf
57/57 [==============================] - 64s 1s/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan - val_yolo_output_2_loss: nan
Epoch 2/10
1/7 [===>..........................] - ETA: 30s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na2/7 [=======>......................] - ETA: 13s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na3/7 [===========>..................] - ETA: 8s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan4/7 [================>.............] - ETA: 4s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan5/7 [====================>.........] - ETA: 2s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan6/7 [========================>.....] - ETA: 1s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan2020-02-21 09:02:25.001690: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-02-21 09:02:25.001806: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_0_loss/Shape_1/_12]]
2020-02-21 09:02:29.828009: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-02-21 09:02:29.828422: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[loss/yolo_output_1_loss/Shape_1/_14]]

Epoch 00002: saving model to checkpoints/yolov3_train_2.tf
57/57 [==============================] - 38s 673ms/step - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan - val_loss: nan - val_yolo_output_0_loss: nan - val_yolo_output_1_loss: nan - val_yolo_output_2_loss: nan
Epoch 3/10
1/7 [===>..........................] - ETA: 30s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na2/7 [=======>......................] - ETA: 14s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: na3/7 [===========>..................] - ETA: 8s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan4/7 [================>.............] - ETA: 5s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan5/7 [====================>.........] - ETA: 2s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan6/7 [========================>.....] - ETA: 1s - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan

@chenminni
Copy link

I encountered this issue as well. It was resolved by providing more training data. You can use some data augmentation to increase your dataset size.

@PieroCV
Copy link

PieroCV commented Feb 24, 2020

Hi!
There are some issues related to this. Probably, if you labeled your dataset with VoTT, some images could be corrupted (especially if you used autolabeling). The Tfrecord files have an item called class_name and an index. The index must always identify the class_name and be the same. This could be one of the problems.
The second problem is the .name file. It could be sound obvious, but if you give a wrong .name file to the Yolo trainer, nan loss will happen.

Finally, the fine-tune mode of training get 80 classes, but if you provide just one, there will be no problem.

If this cases are not yours, reply this issue in order to help you.

@jackyvr
Copy link
Author

jackyvr commented Mar 12, 2020

Thanks for the answers both!

I used 1000 images for transfer learning. Do I still need to increase the number of images? @chenminni , how many images did you use to resolve the problem?

@PieroCV sorry, I don't get your last suggestion "Finally, the fine-tune mode of training get 80 classes, but if you provide just one, there will be no problem." I am training for 1 class. What should I do?

Thanks!

@PieroCV
Copy link

PieroCV commented Mar 12, 2020

Hi, @jackyvr.
If your tfrecord files are corrupted, the images amount doesn't really matter. The train.py won't work.
First, i would like to know how are you getting your data:

  1. Are you using VoTT,LabelImage or something else to generate your tfrecord files?
  2. Did you modify the repo (I saw Binnary Crossentropy modifications, but it is not necessary)?
  3. What is the content of your .names file?
  4. Did you pass the parameters correctly when training?
  5. Could you verify the content of one tfrecord file?

For 5 point, use this:

filenames = ["<filename>"] #Replace here
raw_dataset = tf.data.TFRecordDataset(filenames)  
for raw_record in raw_dataset.take(1):  
  example = tf.train.Example()  
  example.ParseFromString(raw_record.numpy())  
  print(example)  

I hope you could answer as soon as possible in order to help you.

@jackyvr
Copy link
Author

jackyvr commented Mar 12, 2020

Thanks a lot, @PieroCV!

1. Are you using VoTT,LabelImage or something else to generate your tfrecord files?
I used OIDv4_ToolKit to generate the tfrecord files. I downloaded one class with bbox info from OID.

2. Did you modify the repo (I saw Binnary Crossentropy modifications, but it is not necessary)?
I added only printouts for debugging.

3. What is the content of your .names file?
Only one line in my *.names (the name of the class):
glasses

4. Did you pass the parameters correctly when training?
Here is the script that I used for training:
image

5. Could you verify the content of one tfrecord file?
Thanks for the code. I ran the code and the output looked ok. Here is the output
features {
feature {
key: "image/encoded"
value {
bytes_list {
value: "\377\330\377\340\000\020JFIF\000\001\001\001\001,\001,\000\000\377\342\002$ICC_PROFILE\000\001\001\000\000\002\024NKON\002 \000\000mntrRGB XYZ \007\331\000\002\000\024\000\020\0002\000\nacspAPPL\000\000\000\000none\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\366\326\000\001\000\000\000\000\323-\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\tdesc\000\000\001p\000\000\000vrXYZ\000\000\000\360\000\000\000\024gXYZ\000\000\001\004\000\000\000\024bXYZ\000\000\001\030\000\000\000\024rTRC\000\000\001,\000\000\000\016gTRC\000\000\001<\000\000\000\016bTRC\000\000\001L\000\000\000\016wtpt\000\000\001\\000\000\000\024cprt\000\000\001\350\000\000\000,XYZ \000\000\000\000\000\000\234\031\000\000O\246\000\000\004\374XYZ \000\000\000\000\000\0004\213\000\000\240+\000\000\017\225XYZ \000\000\000\000\000\000&2\000\000\020/\000\000\276\240curv\000\000\000\000\000\000\000\001\0023\000\000curv\000\000\000\000\000\000\000\001\0023\000\000curv\000\000\000\000\000\000\000\001\0023\000\000XYZ \000\000\000\000\000\000\363T\000\001\000\000\000\001\026\317desc\000\000\000\000\000\000\000\033Nikon Adobe RGB 4.0.0.3001\000\000\000\000\000\000\000\000\000\000\000\033Nikon Adobe RGB ...[I removed these values... a lot of them]"
}
}
}
feature {
key: "image/filename"
value {
bytes_list {
value: "00022a6311159428"
}
}
}
feature {
key: "image/format"
value {
bytes_list {
value: "jpg"
}
}
}
feature {
key: "image/height"
value {
int64_list {
value: 819
}
}
}
feature {
key: "image/object/bbox/xmax"
value {
float_list {
value: 0.6768749952316284
}
}
}
feature {
key: "image/object/bbox/xmin"
value {
float_list {
value: 0.5056250095367432
}
}
}
feature {
key: "image/object/bbox/ymax"
value {
float_list {
value: 0.35781198740005493
}
}
}
feature {
key: "image/object/bbox/ymin"
value {
float_list {
value: 0.283594012260437
}
}
}
feature {
key: "image/object/class/label"
value {
int64_list {
value: 1
}
}
}
feature {
key: "image/object/class/text"
value {
bytes_list {
value: "Glasses"
}
}
}
feature {
key: "image/source_id"
value {
bytes_list {
value: "00022a6311159428"
}
}
}
feature {
key: "image/width"
value {
int64_list {
value: 1024
}
}
}
}

@PieroCV
Copy link

PieroCV commented Mar 12, 2020

The first thing that I can see is the Upper case on tfrecord file. Change the .names file to "Glasses". I'm kind of busy right now, but i will check the other answers later.

@jackyvr
Copy link
Author

jackyvr commented Mar 12, 2020

Thanks @PieroCV . Will do.

@jackyvr
Copy link
Author

jackyvr commented Mar 12, 2020

@PieroCV , you are amazing! With "glasses" changed to "Glasses" in .names, I am getting non-nan loss now! Why lower case does not work?
Thanks so much!

@jackyvr
Copy link
Author

jackyvr commented Mar 12, 2020

Oh, I see. My folder name has upper case "Glasses". Thanks!

@PieroCV
Copy link

PieroCV commented Mar 12, 2020

@jackyvr no problem!
It doesn't mean that you can't use lower case on .names, it means that you need to match your tfrecord tag with your classes into your .names
I managed to make this repo works on my pc, so I had to solve some issues by myself.
Happy coding!

@jackyvr
Copy link
Author

jackyvr commented Mar 13, 2020

Thank you @PieroCV . Although now I can train with my own data, the generated model does not pick up any objects that I want - even when I feed in a training image. Probably I need to tune the hyper parameters? What loss value is a good value to stop? Sorry, I am new to dl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants