key not found in checkpoint in distributed mode of tensorflow #40

pan463194277 · 2017-07-26T08:05:32Z

when I run the cnn_benchmark function of tf_cnn_benchmark , everything looks fine and checkpoint file is successfully stored on train_dir .But when i run the eval function ,the exception occurs.

……
2017-07-26 15:30:52.072950: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_variance not found in checkpoint
2017-07-26 15:30:52.073198: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/beta not found in checkpoint
2017-07-26 15:30:52.073278: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_mean not found in checkpoint
2017-07-26 15:30:52.073406: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_mean not found in checkpoint
2017-07-26 15:30:52.073536: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073577: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/beta not found in checkpoint
2017-07-26 15:30:52.073661: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_variance not found in checkpoint
2017-07-26 15:30:52.073738: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_variance not found in checkpoint
2017-07-26 15:30:52.073810: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073863: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_variance not found in checkpoint
2017-07-26 15:30:52.073957: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_mean not found in checkpoint
2017-07-26 15:30:52.074055: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/gamma not found in checkpoint
2017-07-26 15:30:52.074110: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/beta not found in checkpoint
2017-07-26 15:30:52.074348: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/beta not found in checkpoint
2017-07-26 15:30:52.074395: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_variance not found in checkpoint
2017-07-26 15:30:52.074757: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_mean not found in checkpoint
2017-07-26 15:30:52.074770: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/gamma not found in checkpoint
2017-07-26 15:30:52.074843: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_mean not found in checkpoint

raceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 1348, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1344, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 885, in run
    self._eval_cnn()
  File "tf_cnn_benchmarks.py", line 901, in _eval_cnn
    global_step = load_checkpoint(saver, sess, FLAGS.train_dir)
  File "tf_cnn_benchmarks.py", line 717, in load_checkpoint
    saver.restore(sess, model_checkpoint_path)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1457, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
	 [[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
	 [[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

Caused by op u'save/RestoreV2_369', defined at:
  File "tf_cnn_benchmarks.py", line 1348, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1344, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 885, in run
    self._eval_cnn()
  File "tf_cnn_benchmarks.py", line 892, in _eval_cnn
    saver = tf.train.Saver(tf.global_variables())
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1056, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1086, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
	 [[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
	 [[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

my train worker script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=worker --task_index=0 --num_gpus 4  --local_parameter_device cpu

parameter script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=ps --task_index=0 --num_gpus 0  --local_parameter_device cpu

eval script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update replicated --model inception3 --batch_size 8 --num_gpus 4 --eval

ll ~/test/train_dir/

total 126348
-rw-rw-r-- 1       143 Jul 26 15:24 checkpoint
-rw-rw-r-- 1       23760967 Jul 26 15:23 graph.pbtxt
-rw-rw-r-- 1       95277612 Jul 26 15:24 model.ckpt-110.data-00000-of-00001
-rw-rw-r-- 1       9461 Jul 26 15:24 model.ckpt-110.index
-rw-rw-r-- 1       10317639 Jul 26 15:24 model.ckpt-110.meta

besides ,I used to run train method in stand-alone mode （ --variable_update replicated ），and the eval function worked well , so I don't know why it doesn't works in distributed_replicated mode. any one who can helps me ? thanks a lot ..

The text was updated successfully, but these errors were encountered:

hangzh2012 · 2018-01-02T07:05:40Z

I had the same problem, how did you solve it?

pan463194277 · 2018-01-02T07:43:39Z

@hangzh2012 I changed some code to adapt it ,here are some questions:
1、 different way of 'variable_update' will save model parameters from different variables scope (local、global 、trainable) ,but in function of 'eval' ,it only read from global scope
2、different way of 'variable_update' will save different parameter variables in checkpoint ,for example, the mode of ‘replicated’ will save num_gpus copies in checkpoints such as 'v0/xxx' 'v1/xxx' while the mode of 'distributed_replicated' only save variables starting with 'v/'

besides ,you can read my commit here for reference : pan463194277@3e98be2

reedwm · 2018-01-02T18:21:22Z

I believe we fixed this issue. @hangzh2012 can you try with the latest version of tf_cnn_benchmarks? Note --eval only works in non-distributed mode.

hangzh2012 · 2018-01-03T01:10:42Z

@reedwm Yes, it worked fine when using (--variable_update replicated).

hangzh2012 · 2018-01-03T01:22:12Z

@pan463194277 Thank you for your reply. It worked well now. I used the eval method in a wrong way(--variable_update parameter_server).

Merge internal changes into public repository (change 175579877)

pan463194277 closed this as completed Aug 21, 2017

freedomtan pushed a commit to freedomtan/benchmarks that referenced this issue Apr 18, 2018

Merge pull request tensorflow#40 from tensorflow/internal-to-github-sync

841273f

Merge internal changes into public repository (change 175579877)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

key not found in checkpoint in distributed mode of tensorflow #40

key not found in checkpoint in distributed mode of tensorflow #40

pan463194277 commented Jul 26, 2017

hangzh2012 commented Jan 2, 2018

pan463194277 commented Jan 2, 2018

reedwm commented Jan 2, 2018

hangzh2012 commented Jan 3, 2018

hangzh2012 commented Jan 3, 2018

key not found in checkpoint in distributed mode of tensorflow #40

key not found in checkpoint in distributed mode of tensorflow #40

Comments

pan463194277 commented Jul 26, 2017

hangzh2012 commented Jan 2, 2018

pan463194277 commented Jan 2, 2018

reedwm commented Jan 2, 2018

hangzh2012 commented Jan 3, 2018

hangzh2012 commented Jan 3, 2018