Error when training voc2012 with mask rcnn #3972

Philip-Chen · 2018-04-13T12:08:52Z

The same error on all datasets and all mask models

System information

What is the top-level directory of the model you are using:Object Detection
Have I written custom code:No
OS Platform and Distribution:Linux Ubuntu 18.04
TensorFlow installed from:anaconda3
TensorFlow version:1.6.0
Bazel version (if compiling from source):
CUDA/cuDNN version:9.0.176/7.0.5
GPU model and memory:GT1030 2GB
Exact command to reproduce:

(tensorflow) philip_chen@Chen-Lenovo:~/TensorFlow/models/research$ CUDA_VISIBLE_DEVICES=1 python object_detection/train.py --logtostderr --pipeline_config_path=/home/philip_chen/TensorFlow/models/research/object_detection/mask_rcnn_inception_v2_coco_2018_01_28/mask_rcnn_inception_v2_coco.config --train_dir=/home/philip_chen/TensorFlow/models/research/object_detection/mask_rcnn_inception_v2_coco_2018_01_28/train

EDIT: (robieta) Moved full output to a separate file
obj_detection_output.txt

/home/philip_chen/anaconda3/envs/tensorflow/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.

...

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [2]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/packed)]]

hedeya1980 · 2018-04-15T20:40:37Z

I face the same error, and I really need help about how to solve it.

lulu12132017 · 2018-04-16T09:22:36Z

Me too.Has anyone solved it?

robieta · 2018-04-16T16:43:55Z

If you run without the checkpoint do you still get the assertion errors?

hedeya1980 · 2018-04-20T20:16:34Z

Hi @robieta ,
What do you mean by running without the checkpoint? Do you mean that I should set 'from_detection_checkpoint:' to 'false' in the configuration file?

When I did this, I got other errors.

Could you pls clarify?

robieta · 2018-04-20T22:57:03Z

What are the errors that you get when from_detection_checkpoint to false?

hedeya1980 · 2018-04-21T19:24:26Z

Hi @robieta,
When I set from_detection_checkpoint to false (mask_rcnn_inception_resnet_v2_atrous_coco), I got the following erros:

EDIT: (robieta) Moved full output to a separate file
obj_detection_output2.txt

C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
WARNING:root:Variable [InceptionResnetV2/Block8/Branch_0/Conv2d_1x1/BatchNorm/beta] is not available in checkpoint

...

WARNING:root:Variable [InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/weights/Momentum] is not available in checkpoint
Traceback (most recent call last):
File "train.py", line 167, in
tf.app.run()
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run
_sys.exit(main(argv))
File "train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:\Users\hedey\models\research\object_detection\trainer.py", line 352, in train
init_saver = tf.train.Saver(available_var_map)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1239, in init
self.build()
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1272, in _build
raise ValueError("No variables to save")
ValueError: No variables to save

lulu12132017 · 2018-04-22T13:57:17Z

Do not use checkpoint。like this

#fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: false

you can try

hedeya1980 · 2018-04-22T17:37:08Z

Hi @lulu12132017 ,

Now, I get the following errors:

EDIT: (robieta) Moved full output to a separate file
obj_detection_output3.txt

INFO:tensorflow:Error reported to Coordinator: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]

...

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_133, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_135, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/shape/_137)]]
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta/read/_305 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2367_FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

hedeya1980 · 2018-04-23T19:29:07Z

Hi @lulu12132017 & @robieta,

I really need your help to get a solution for this, because I need to use the tensorflow object detection API in my master's project.

robieta · 2018-04-23T20:17:07Z

I'm going to close this and refer you to the tensorflow StackOverflow, as this appears to be a configuration issue rather than a clear bug in the object detection code.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

SarvMangal · 2018-05-07T09:31:05Z

Although the issue is closed by Robieta, the solution isn't available anywhere. There are multiple bugs on this issue with no suggestion what the configuration is and what is the real way of solving this. Please help.

hedeya1980 · 2018-05-07T20:34:56Z

Hi @SarvMangal,
I agree with you.
We need help by getting a real way of solving this.
Even after I followed @robieta's advice and posted at StackOverflow, I haven't received any replies yet.
Here is my Stackoverflow post:
https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t

SarvMangal · 2018-05-07T23:55:00Z

Isn't there any way of reopening this thread? Or I will add one more issue with all the required details. Even if it is a configuration issue, the documentation is just not enough to help us solve the problem.

…

On Tue 8 May, 2018, 2:05 AM hedeya1980, ***@***.***> wrote: Hi @SarvMangal <https://github.com/SarvMangal>, I agree with you. We need help by getting a real way of solving this. Even after I followed @robieta <https://github.com/robieta>'s advice and posted at StackOverflow, I haven't received any replies yet. Here is my Stackoverflow post: https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3972 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUNFigk--1MPYemBxoLQrVF3s8PsxYJxks5twLAngaJpZM4TTVyi> .

lulu12132017 · 2018-05-16T15:32:53Z

When you convert the MIO-TCD dataset into TFRecord,you should set include_masks parameter like this. --include_masks=True You can try. 在 2018-05-08 04:35:51，"hedeya1980" <notifications@github.com> 写道： Hi @SarvMangal, I agree with you. We need help by getting a real way of solving this. Even after I followed @robieta's advice and posted at StackOverflow, I haven't received any replies yet. Here is my Stackoverflow post: https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hedeya1980 · 2018-05-21T21:59:41Z

Hi @lulu12132017 ,
Thanks for your reply. However, could you pls clarify the following:

Does this require my dataset to have masks data? I'm working on the MIO-TCD dataset and it doesn't have any masks data.
the function that I defined to create a tf_example doesn't include include_masks parameter, so I'm not clear about where I should set the include_masks parameter.

Abduoit · 2018-05-29T16:54:38Z

I have same issue
I have created TFRecord files by using create_pet_tf_record.py now I am trying to train my dateset with mask_rcnn but I am getting same issue. Is there new suggestion please ?

Abduoit · 2018-05-29T20:23:23Z

@hedeya1980 I could not post my answer in your question in stackoverflow

I had this problem, I solved as follow:

The name of the TFRecords files should be pet_train/val.record. I changed it by editing the faces_only from True to False

check the line here
https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pet_tf_record.py#L49

Then, I regenerated TFRecord files by this

python object_detection/dataset_tools/create_pet_tf_record.py
 --label_map_path=object_detection/data/two_label_map.pbtxt 
--data_dir=`pwd`     --output_dir=`pwd` --include_masks=True

Then, I got two TFRecords files with names pet_train/val.record, then I used them for training process with mask_rcnn_inception_v2_coco

Hope this helps

Abduoit · 2018-06-05T15:33:13Z

I have this issue only when I use TFRecord files generated by create_pascal_tf_record.py. I don't have it when I use TFRecord files generated by create_pet_tf_record.py as I mentioned earlier. Is there any update?

wxianfeng · 2018-06-12T10:25:04Z

when i set faces_only from True to False

it's solved

what's faces_only means ?

erdag · 2018-06-17T04:51:10Z

I am still getting this error on this issue?.Has anybody figured this out yet?

NotFoundError (see above for traceback): Key Conv/biases/Momentum not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

leccyril · 2018-07-13T13:51:35Z

faces_only means we display only box on faces not on whole body, and no segmentation is made

tensorflowbutler assigned robieta Apr 13, 2018

robieta mentioned this issue Apr 16, 2018

Unable to load checkpoints for Nasnet? #3938

Closed

robieta closed this as completed Apr 23, 2018

SarvMangal mentioned this issue May 8, 2018

Cannot train the mask-rcnn models #3913

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when training voc2012 with mask rcnn #3972

Error when training voc2012 with mask rcnn #3972

Philip-Chen commented Apr 13, 2018 •

edited by robieta

hedeya1980 commented Apr 15, 2018

lulu12132017 commented Apr 16, 2018

robieta commented Apr 16, 2018

hedeya1980 commented Apr 20, 2018

robieta commented Apr 20, 2018

hedeya1980 commented Apr 21, 2018 •

edited by robieta

lulu12132017 commented Apr 22, 2018

hedeya1980 commented Apr 22, 2018 •

edited by robieta

hedeya1980 commented Apr 23, 2018

robieta commented Apr 23, 2018

SarvMangal commented May 7, 2018

hedeya1980 commented May 7, 2018

SarvMangal commented May 7, 2018 via email

lulu12132017 commented May 16, 2018 via email

hedeya1980 commented May 21, 2018

Abduoit commented May 29, 2018

Abduoit commented May 29, 2018 •

edited

Abduoit commented Jun 5, 2018 •

edited

wxianfeng commented Jun 12, 2018

erdag commented Jun 17, 2018

leccyril commented Jul 13, 2018

Error when training voc2012 with mask rcnn #3972

Error when training voc2012 with mask rcnn #3972

Comments

Philip-Chen commented Apr 13, 2018 • edited by robieta

System information

hedeya1980 commented Apr 15, 2018

lulu12132017 commented Apr 16, 2018

robieta commented Apr 16, 2018

hedeya1980 commented Apr 20, 2018

robieta commented Apr 20, 2018

hedeya1980 commented Apr 21, 2018 • edited by robieta

lulu12132017 commented Apr 22, 2018

hedeya1980 commented Apr 22, 2018 • edited by robieta

hedeya1980 commented Apr 23, 2018

robieta commented Apr 23, 2018

SarvMangal commented May 7, 2018

hedeya1980 commented May 7, 2018

SarvMangal commented May 7, 2018 via email

lulu12132017 commented May 16, 2018 via email

hedeya1980 commented May 21, 2018

Abduoit commented May 29, 2018

Abduoit commented May 29, 2018 • edited

Abduoit commented Jun 5, 2018 • edited

wxianfeng commented Jun 12, 2018

erdag commented Jun 17, 2018

leccyril commented Jul 13, 2018

Philip-Chen commented Apr 13, 2018 •

edited by robieta

hedeya1980 commented Apr 21, 2018 •

edited by robieta

hedeya1980 commented Apr 22, 2018 •

edited by robieta

Abduoit commented May 29, 2018 •

edited

Abduoit commented Jun 5, 2018 •

edited