Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

key not found in checkpoint in distributed mode of tensorflow #40

Closed
pan463194277 opened this issue Jul 26, 2017 · 5 comments
Closed

key not found in checkpoint in distributed mode of tensorflow #40

pan463194277 opened this issue Jul 26, 2017 · 5 comments

Comments

@pan463194277
Copy link

when I run the cnn_benchmark function of tf_cnn_benchmark , everything looks fine and checkpoint file is successfully stored on train_dir .But when i run the eval function ,the exception occurs.

……
2017-07-26 15:30:52.072950: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_variance not found in checkpoint
2017-07-26 15:30:52.073198: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/beta not found in checkpoint
2017-07-26 15:30:52.073278: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_mean not found in checkpoint
2017-07-26 15:30:52.073406: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_mean not found in checkpoint
2017-07-26 15:30:52.073536: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073577: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/beta not found in checkpoint
2017-07-26 15:30:52.073661: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_variance not found in checkpoint
2017-07-26 15:30:52.073738: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_variance not found in checkpoint
2017-07-26 15:30:52.073810: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073863: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_variance not found in checkpoint
2017-07-26 15:30:52.073957: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_mean not found in checkpoint
2017-07-26 15:30:52.074055: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/gamma not found in checkpoint
2017-07-26 15:30:52.074110: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/beta not found in checkpoint
2017-07-26 15:30:52.074348: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/beta not found in checkpoint
2017-07-26 15:30:52.074395: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_variance not found in checkpoint
2017-07-26 15:30:52.074757: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_mean not found in checkpoint
2017-07-26 15:30:52.074770: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/gamma not found in checkpoint
2017-07-26 15:30:52.074843: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_mean not found in checkpoint

raceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 1348, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1344, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 885, in run
    self._eval_cnn()
  File "tf_cnn_benchmarks.py", line 901, in _eval_cnn
    global_step = load_checkpoint(saver, sess, FLAGS.train_dir)
  File "tf_cnn_benchmarks.py", line 717, in load_checkpoint
    saver.restore(sess, model_checkpoint_path)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1457, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
	 [[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
	 [[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

Caused by op u'save/RestoreV2_369', defined at:
  File "tf_cnn_benchmarks.py", line 1348, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1344, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 885, in run
    self._eval_cnn()
  File "tf_cnn_benchmarks.py", line 892, in _eval_cnn
    saver = tf.train.Saver(tf.global_variables())
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1056, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1086, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
	 [[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
	 [[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

my train worker script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=worker --task_index=0 --num_gpus 4  --local_parameter_device cpu

parameter script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=ps --task_index=0 --num_gpus 0  --local_parameter_device cpu

eval script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update replicated --model inception3 --batch_size 8 --num_gpus 4 --eval

ll ~/test/train_dir/

total 126348
-rw-rw-r-- 1       143 Jul 26 15:24 checkpoint
-rw-rw-r-- 1       23760967 Jul 26 15:23 graph.pbtxt
-rw-rw-r-- 1       95277612 Jul 26 15:24 model.ckpt-110.data-00000-of-00001
-rw-rw-r-- 1       9461 Jul 26 15:24 model.ckpt-110.index
-rw-rw-r-- 1       10317639 Jul 26 15:24 model.ckpt-110.meta

besides ,I used to run train method in stand-alone mode ( --variable_update replicated ),and the eval function worked well , so I don't know why it doesn't works in distributed_replicated mode. any one who can helps me ? thanks a lot ..

@hangzh2012
Copy link

I had the same problem, how did you solve it?

@pan463194277
Copy link
Author

@hangzh2012 I changed some code to adapt it ,here are some questions:
1、 different way of 'variable_update' will save model parameters from different variables scope (local、global 、trainable) ,but in function of 'eval' ,it only read from global scope
2、different way of 'variable_update' will save different parameter variables in checkpoint ,for example, the mode of ‘replicated’ will save num_gpus copies in checkpoints such as 'v0/xxx' 'v1/xxx' while the mode of 'distributed_replicated' only save variables starting with 'v/'

besides ,you can read my commit here for reference : pan463194277@3e98be2

@reedwm
Copy link
Member

reedwm commented Jan 2, 2018

I believe we fixed this issue. @hangzh2012 can you try with the latest version of tf_cnn_benchmarks? Note --eval only works in non-distributed mode.

@hangzh2012
Copy link

@reedwm Yes, it worked fine when using (--variable_update replicated).

@hangzh2012
Copy link

@pan463194277 Thank you for your reply. It worked well now. I used the eval method in a wrong way(--variable_update parameter_server).

freedomtan pushed a commit to freedomtan/benchmarks that referenced this issue Apr 18, 2018
Merge internal changes into public repository (change 175579877)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants