-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
key not found in checkpoint in distributed mode of tensorflow #40
Comments
I had the same problem, how did you solve it? |
@hangzh2012 I changed some code to adapt it ,here are some questions: besides ,you can read my commit here for reference : pan463194277@3e98be2 |
I believe we fixed this issue. @hangzh2012 can you try with the latest version of tf_cnn_benchmarks? Note |
@reedwm Yes, it worked fine when using (--variable_update replicated). |
@pan463194277 Thank you for your reply. It worked well now. I used the eval method in a wrong way(--variable_update parameter_server). |
Merge internal changes into public repository (change 175579877)
when I run the cnn_benchmark function of tf_cnn_benchmark , everything looks fine and checkpoint file is successfully stored on train_dir .But when i run the eval function ,the exception occurs.
my train worker script
parameter script
eval script
ll ~/test/train_dir/
besides ,I used to run train method in stand-alone mode ( --variable_update replicated ),and the eval function worked well , so I don't know why it doesn't works in distributed_replicated mode. any one who can helps me ? thanks a lot ..
The text was updated successfully, but these errors were encountered: