Skip to content

Missing min/max nodes in quantization-aware trained model with multi-gpu #8445

@marco-fariselli

Description

@marco-fariselli

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Linux Ubuntu 18.04
  • Mobile device (e.g., Pixel 4, Samsung Galaxy 10) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
    binary
  • TensorFlow version (use command below):
    1.13.2
  • Python version:
    3.6.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
    10.0.130
  • GPU model and memory:
    4 GPU GeForce GTX 1080 Ti 11178MB

Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v2.py

Describe the current behavior
I used the default training script to fine-tune a MobileNet-V2 on ImageNet with quantization-aware training starting from the released floating point checkpoint (https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_1.0_224.tgz). I specified --num_clones=3 to use multiple gpus during training. The training process proceeds as expected, but when I try to export the generated model ckpt for evaluation, the system raises an NotFoundError because of missing nodes in the checkpoint file (Key MobilenetV2/Conv/act_quant/max not found in checkpoint). The evaluation is launched with quantization script enabled (--quantize flag). Looking at the graph nodes seems that when multi-gpus training is enabled the min/max layers take a name which will not be recognized in the inference graph, i.e. clone_0/MobilenetV2/Conv/act_quant/clone_0/MobilenetV2/Conv/act_quant/max/biased. This behavior is not reported with single gpu training. I suspect the problem in the create clones function which turns such layer with a name_scope related to the clone id.

Describe the expected behavior
I expect that at the end of the training process, when the ckpt is saved the multiple min/max from different clones instances were folded in a single tensor with the same name of the inference graph.

Code to reproduce the issue
CUDA_VISIBLE_DEVICES=1,2,3 python3 train_image_classifier.py --model_name=mobilenet_v2 --dataset_dir=/home/Dataset/imagenet-tf/ --dataset_split_name='train' --num_clones=3 --max_number_of_steps=410000 --quantize_delay=10 --train_dir=mobilenet_v2/imagenet/train_dir_32batch_quant --batch_size=32

CUDA_VISIBLE_DEVICES=1,2,3 python3 eval_image_classifier.py --model_name=mobilenet_v2 --dataset_dir=/home/fariselli/Dataset/imagenet-tf/ --dataset_split_name='validation' --num_clones=3 --quantize --eval_dir=mobilenet_v2/imagenet/train_dir_32batch_quant_quantbeforeclones/eval --checkpoint_path=mobilenet_v2/imagenet/train_dir_32batch_quant

Other info / logs

INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Skipping quant after MobilenetV2/Conv/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_1/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_1/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_2/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_2/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_3/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_3/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_4/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_4/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_5/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_5/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_6/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_6/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_7/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_7/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_8/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_8/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_9/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_9/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_10/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_10/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_11/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_11/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_12/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_12/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_13/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_13/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_14/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_14/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_15/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_15/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_16/expand/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/expanded_conv_16/depthwise/add_fold
INFO:tensorflow:Skipping quant after MobilenetV2/Conv_1/add_fold
WARNING:tensorflow:From eval_image_classifier.py:169: streaming_accuracy (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.metrics.accuracy. Note that the order of the labels and predictions arguments has been switched.
WARNING:tensorflow:From eval_image_classifier.py:182: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
python
    sess = tf.Session()
    with sess.as_default():
        tensor = tf.range(10)
        print_op = tf.print(tensor)
        with tf.control_dependencies([print_op]):
          out = tf.add(tensor, tensor)
        sess.run(out)

Additionally, to use tf.print in python 2.7, users must make sure to import
the following:

  `from __future__ import print_function`

INFO:tensorflow:Evaluating mobilenet_v2/imagenet/train_dir_32batch_quant/model.ckpt-167061
INFO:tensorflow:Starting evaluation at 2020-04-27T08:24:12Z
INFO:tensorflow:Graph was finalized.
2020-04-27 08:24:12.842443: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 08:24:13.448974: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5d6a5b0 executing computations on platform CUDA. Devices:
2020-04-27 08:24:13.449023: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-04-27 08:24:13.449035: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-04-27 08:24:13.449045: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-04-27 08:24:13.452643: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394485000 Hz
2020-04-27 08:24:13.456059: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3908e20 executing computations on platform Host. Devices:
2020-04-27 08:24:13.456135: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-04-27 08:24:13.456402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
totalMemory: 10.92GiB freeMemory: 10.51GiB
2020-04-27 08:24:13.456501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:81:00.0
totalMemory: 10.92GiB freeMemory: 10.09GiB
2020-04-27 08:24:13.456561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:82:00.0
totalMemory: 10.92GiB freeMemory: 10.50GiB
2020-04-27 08:24:13.456929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2
2020-04-27 08:24:13.462417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-27 08:24:13.462463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 2 
2020-04-27 08:24:13.462479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N N N 
2020-04-27 08:24:13.462495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N N Y 
2020-04-27 08:24:13.462509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2:   N Y N 
2020-04-27 08:24:13.464485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10222 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-04-27 08:24:13.465307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9811 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1)
2020-04-27 08:24:13.466036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10219 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from mobilenet_v2/imagenet/train_dir_32batch_quant/model.ckpt-167061
2020-04-27 08:24:14.183966: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key MobilenetV2/Conv/act_quant/max not found in checkpoint
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/act_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1276, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]

Caused by op 'save/RestoreV2', defined at:
  File "eval_image_classifier.py", line 211, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "eval_image_classifier.py", line 205, in main
    variables_to_restore=variables_to_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 205, in evaluate_once
    saver = tf_saver.Saver(variables_to_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 832, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 844, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 881, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Key MobilenetV2/Conv/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1286, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1591, in object_graph_key_mapping
    checkpointable.OBJECT_GRAPH_PROTO_KEY)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 370, in get_tensor
    status)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "eval_image_classifier.py", line 211, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "eval_image_classifier.py", line 205, in main
    variables_to_restore=variables_to_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 217, in evaluate_once
    config=session_config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
    session_creator=session_creator, hooks=hooks) as session:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 934, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1122, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1127, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 805, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 571, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 195, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1292, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key MobilenetV2/Conv/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]

Caused by op 'save/RestoreV2', defined at:
  File "eval_image_classifier.py", line 211, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "eval_image_classifier.py", line 205, in main
    variables_to_restore=variables_to_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 205, in evaluate_once
    saver = tf_saver.Saver(variables_to_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 832, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 844, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 881, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key MobilenetV2/Conv/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at eval_image_classifier.py:205) ]]

Metadata

Metadata

Assignees

Labels

models:researchmodels that come under research directorytype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions