GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and
privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Since we wrote below code in the parameter server part:
the parameter server could not stop itself when the training finishes unless we kill the process. do you have other suggestions?
Yes, unfortunately this is a known issue. There's no way to automatically stop parameter servers when the job is done at the moment. I'll ping back on this issue when we have a better solution, but it's not high-priority at the moment.
It is safe to kill the ps process after your training is done (and your checkpoint is saved as well). Do you have a specific concern?
There's a work-around in tensorflow/tensorflow#4713 (comment)
Stop ps server gracefully is a requirement when run distributed training with kubernetes batch job. Anyone can write a detail demo base on mnist_replica?
I write a demo for MNIST_data, and it seems run OK. See dist_fifo.