Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training voc2012 with mask rcnn #3972

Closed
Philip-Chen opened this issue Apr 13, 2018 · 21 comments
Closed

Error when training voc2012 with mask rcnn #3972

Philip-Chen opened this issue Apr 13, 2018 · 21 comments
Assignees

Comments

@Philip-Chen
Copy link

Philip-Chen commented Apr 13, 2018

The same error on all datasets and all mask models

System information

  • What is the top-level directory of the model you are using:Object Detection
  • Have I written custom code:No
  • OS Platform and Distribution:Linux Ubuntu 18.04
  • TensorFlow installed from:anaconda3
  • TensorFlow version:1.6.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version:9.0.176/7.0.5
  • GPU model and memory:GT1030 2GB
  • Exact command to reproduce:

(tensorflow) philip_chen@Chen-Lenovo:~/TensorFlow/models/research$ CUDA_VISIBLE_DEVICES=1 python object_detection/train.py --logtostderr --pipeline_config_path=/home/philip_chen/TensorFlow/models/research/object_detection/mask_rcnn_inception_v2_coco_2018_01_28/mask_rcnn_inception_v2_coco.config --train_dir=/home/philip_chen/TensorFlow/models/research/object_detection/mask_rcnn_inception_v2_coco_2018_01_28/train

EDIT: (robieta) Moved full output to a separate file
obj_detection_output.txt

/home/philip_chen/anaconda3/envs/tensorflow/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.

...

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [2]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/packed)]]

@hedeya1980
Copy link

I face the same error, and I really need help about how to solve it.

@lulu12132017
Copy link

Me too.Has anyone solved it?

@robieta
Copy link
Contributor

robieta commented Apr 16, 2018

If you run without the checkpoint do you still get the assertion errors?

@hedeya1980
Copy link

Hi @robieta ,
What do you mean by running without the checkpoint? Do you mean that I should set 'from_detection_checkpoint:' to 'false' in the configuration file?

When I did this, I got other errors.

Could you pls clarify?

@robieta
Copy link
Contributor

robieta commented Apr 20, 2018

What are the errors that you get when from_detection_checkpoint to false?

@hedeya1980
Copy link

hedeya1980 commented Apr 21, 2018

Hi @robieta,
When I set from_detection_checkpoint to false (mask_rcnn_inception_resnet_v2_atrous_coco), I got the following erros:

EDIT: (robieta) Moved full output to a separate file
obj_detection_output2.txt

C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
WARNING:root:Variable [InceptionResnetV2/Block8/Branch_0/Conv2d_1x1/BatchNorm/beta] is not available in checkpoint

...

WARNING:root:Variable [InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/weights/Momentum] is not available in checkpoint
Traceback (most recent call last):
File "train.py", line 167, in
tf.app.run()
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run
_sys.exit(main(argv))
File "train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:\Users\hedey\models\research\object_detection\trainer.py", line 352, in train
init_saver = tf.train.Saver(available_var_map)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1239, in init
self.build()
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "C:\Users\hedey\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1272, in _build
raise ValueError("No variables to save")
ValueError: No variables to save

@lulu12132017
Copy link

Do not use checkpoint。like this

#fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: false

you can try

@hedeya1980
Copy link

hedeya1980 commented Apr 22, 2018

Hi @lulu12132017 ,

Now, I get the following errors:

EDIT: (robieta) Moved full output to a separate file
obj_detection_output3.txt

INFO:tensorflow:Error reported to Coordinator: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]

...

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]
[[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_133, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_135, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/shape/_137)]]
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta/read/_305 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2367_FirstStageFeatureExtractor/InceptionResnetV2/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

@hedeya1980
Copy link

Hi @lulu12132017 & @robieta,

I really need your help to get a solution for this, because I need to use the tensorflow object detection API in my master's project.

@robieta
Copy link
Contributor

robieta commented Apr 23, 2018

I'm going to close this and refer you to the tensorflow StackOverflow, as this appears to be a configuration issue rather than a clear bug in the object detection code.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

@robieta robieta closed this as completed Apr 23, 2018
@SarvMangal
Copy link

Although the issue is closed by Robieta, the solution isn't available anywhere. There are multiple bugs on this issue with no suggestion what the configuration is and what is the real way of solving this. Please help.

@hedeya1980
Copy link

Hi @SarvMangal,
I agree with you.
We need help by getting a real way of solving this.
Even after I followed @robieta's advice and posted at StackOverflow, I haven't received any replies yet.
Here is my Stackoverflow post:
https://stackoverflow.com/questions/50009709/assertion-failed-error-when-using-tensorflow-object-detection-api-to-fine-tune-t

@SarvMangal
Copy link

SarvMangal commented May 7, 2018 via email

@lulu12132017
Copy link

lulu12132017 commented May 16, 2018 via email

@hedeya1980
Copy link

Hi @lulu12132017 ,
Thanks for your reply. However, could you pls clarify the following:

  • Does this require my dataset to have masks data? I'm working on the MIO-TCD dataset and it doesn't have any masks data.

  • the function that I defined to create a tf_example doesn't include include_masks parameter, so I'm not clear about where I should set the include_masks parameter.

@Abduoit
Copy link

Abduoit commented May 29, 2018

I have same issue
I have created TFRecord files by using create_pet_tf_record.py now I am trying to train my dateset with mask_rcnn but I am getting same issue. Is there new suggestion please ?

@Abduoit
Copy link

Abduoit commented May 29, 2018

@hedeya1980 I could not post my answer in your question in stackoverflow

I had this problem, I solved as follow:

The name of the TFRecords files should be pet_train/val.record. I changed it by editing the faces_only from True to False

check the line here
https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pet_tf_record.py#L49

Then, I regenerated TFRecord files by this

python object_detection/dataset_tools/create_pet_tf_record.py
 --label_map_path=object_detection/data/two_label_map.pbtxt 
--data_dir=`pwd`     --output_dir=`pwd` --include_masks=True

Then, I got two TFRecords files with names pet_train/val.record, then I used them for training process with mask_rcnn_inception_v2_coco

Hope this helps

@Abduoit
Copy link

Abduoit commented Jun 5, 2018

I have this issue only when I use TFRecord files generated by create_pascal_tf_record.py. I don't have it when I use TFRecord files generated by create_pet_tf_record.py as I mentioned earlier. Is there any update?

@wxianfeng
Copy link

when i set faces_only from True to False

it's solved

what's faces_only means ?

@erdag
Copy link

erdag commented Jun 17, 2018

I am still getting this error on this issue?.Has anybody figured this out yet?

NotFoundError (see above for traceback): Key Conv/biases/Momentum not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

@leccyril
Copy link

faces_only means we display only box on faces not on whole body, and no segmentation is made

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants