-
Notifications
You must be signed in to change notification settings - Fork 45.3k
Description
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No - OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 18.04 - Mobile device (e.g., Pixel 4, Samsung Galaxy 10) if the issue happens on mobile device:
- TensorFlow installed from (source or binary):
binary - TensorFlow version (use command below):
1.13.2 - Python version:
3.6.8 - Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
10.0.130 - GPU model and memory:
4 GPU GeForce GTX 1080 Ti 11178MB
Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v2.py
Describe the current behavior
I used the default training script to fine-tune a MobileNet-V2 on ImageNet with quantization-aware training starting from the released floating point checkpoint (https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_1.0_224.tgz). I specified --num_clones=3 to use multiple gpus during training. The training process proceeds as expected, but when I try to export the generated model ckpt for evaluation, the system raises an NotFoundError because of missing nodes in the checkpoint file (Key MobilenetV2/Conv/act_quant/max not found in checkpoint). The evaluation is launched with quantization script enabled (--quantize flag). Looking at the graph nodes seems that when multi-gpus training is enabled the min/max layers take a name which will not be recognized in the inference graph, i.e. clone_0/MobilenetV2/Conv/act_quant/clone_0/MobilenetV2/Conv/act_quant/max/biased. This behavior is not reported with single gpu training. I suspect the problem in the create clones function which turns such layer with a name_scope related to the clone id.
Describe the expected behavior
I expect that at the end of the training process, when the ckpt is saved the multiple min/max from different clones instances were folded in a single tensor with the same name of the inference graph.
Code to reproduce the issue
CUDA_VISIBLE_DEVICES=1,2,3 python3 train_image_classifier.py --model_name=mobilenet_v2 --dataset_dir=/home/Dataset/imagenet-tf/ --dataset_split_name='train' --num_clones=3 --max_number_of_steps=410000 --quantize_delay=10 --train_dir=mobilenet_v2/imagenet/train_dir_32batch_quant --batch_size=32
CUDA_VISIBLE_DEVICES=1,2,3 python3 eval_image_classifier.py --model_name=mobilenet_v2 --dataset_dir=/home/fariselli/Dataset/imagenet-tf/ --dataset_split_name='validation' --num_clones=3 --quantize --eval_dir=mobilenet_v2/imagenet/train_dir_32batch_quant_quantbeforeclones/eval --checkpoint_path=mobilenet_v2/imagenet/train_dir_32batch_quant
Other info / logs
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Skipping quant after MobilenetV2/Conv/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_1/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_1/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_2/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_2/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_3/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_3/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_4/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_4/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_5/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_5/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_6/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_6/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_7/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_7/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_8/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_8/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_9/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_9/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_10/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_10/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_11/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_11/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_12/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_12/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_13/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_13/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_14/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_14/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_15/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_15/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_16/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_16/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/Conv_1/add_fold
WARNING:tensorflow:From eval_image_classifier.py:169: streaming_accuracy (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.metrics.accuracy. Note that the order of the labels and predictions arguments has been switched.
WARNING:tensorflow:From eval_image_classifier.py:182: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
python
sess = tf.Session()
with sess.as_default():
tensor = tf.range(10)
print_op = tf.print(tensor)
with tf.control_dependencies([print_op]):
out = tf.add(tensor, tensor)
sess.run(out)
Additionally, to use tf.print in python 2.7, users must make sure to import
the following:
`from __future__ import print_function`
INFO:tensorflow:Evaluating mobilenet_v2/imagenet/train_dir_32batch_quant/model.ckpt-167061
INFO:tensorflow:Starting evaluation at 2020-04-27T08:24:12Z
INFO:tensorflow:Graph was finalized.
2020-04-27 08:24:12.842443: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 08:24:13.448974: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5d6a5b0 executing computations on platform CUDA. Devices:
2020-04-27 08:24:13.449023: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-04-27 08:24:13.449035: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-04-27 08:24:13.449045: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-04-27 08:24:13.452643: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394485000 Hz
2020-04-27 08:24:13.456059: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3908e20 executing computations on platform Host. Devices:
2020-04-27 08:24:13.456135: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2020-04-27 08:24:13.456402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
totalMemory: 10.92GiB freeMemory: 10.51GiB
2020-04-27 08:24:13.456501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:81:00.0
totalMemory: 10.92GiB freeMemory: 10.09GiB
2020-04-27 08:24:13.456561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:82:00.0
totalMemory: 10.92GiB freeMemory: 10.50GiB
2020-04-27 08:24:13.456929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2
2020-04-27 08:24:13.462417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-27 08:24:13.462463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2
2020-04-27 08:24:13.462479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N
2020-04-27 08:24:13.462495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N Y
2020-04-27 08:24:13.462509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N Y N
2020-04-27 08:24:13.464485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10222 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-04-27 08:24:13.465307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9811 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1)
2020-04-27 08:24:13.466036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10219 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from mobilenet_v2/imagenet/train_dir_32batch_quant/model.ckpt-167061
2020-04-27 08:24:14.183966: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key MobilenetV2/Conv/act_quant/max not found in checkpoint
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/act_quant/max not found in checkpoint
[[{{node save/RestoreV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1276, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/act_quant/max not found in checkpoint
[[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]
Caused by op 'save/RestoreV2', defined at:
File "eval_image_classifier.py", line 211, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "eval_image_classifier.py", line 205, in main
variables_to_restore=variables_to_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 205, in evaluate_once
saver = tf_saver.Saver(variables_to_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 832, in __init__
self.build()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 844, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 881, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): Key MobilenetV2/Conv/act_quant/max not found in checkpoint
[[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1286, in restore
names_to_keys = object_graph_key_mapping(save_path)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1591, in object_graph_key_mapping
checkpointable.OBJECT_GRAPH_PROTO_KEY)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 370, in get_tensor
status)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval_image_classifier.py", line 211, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "eval_image_classifier.py", line 205, in main
variables_to_restore=variables_to_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 217, in evaluate_once
config=session_config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 934, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1122, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1127, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 805, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 571, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 195, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1292, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key MobilenetV2/Conv/act_quant/max not found in checkpoint
[[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]
Caused by op 'save/RestoreV2', defined at:
File "eval_image_classifier.py", line 211, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "eval_image_classifier.py", line 205, in main
variables_to_restore=variables_to_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 205, in evaluate_once
saver = tf_saver.Saver(variables_to_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 832, in __init__
self.build()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 844, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 881, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key MobilenetV2/Conv/act_quant/max not found in checkpoint
[[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]