New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can the parameter server stop itself? #19

Open
YongCHN opened this Issue Nov 1, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@YongCHN

YongCHN commented Nov 1, 2016

Since we wrote below code in the parameter server part:
server.join()

the parameter server could not stop itself when the training finishes unless we kill the process. do you have other suggestions?

@jhseu

This comment has been minimized.

Member

jhseu commented Nov 1, 2016

Yes, unfortunately this is a known issue. There's no way to automatically stop parameter servers when the job is done at the moment. I'll ping back on this issue when we have a better solution, but it's not high-priority at the moment.

@yuefengz

This comment has been minimized.

Member

yuefengz commented Nov 1, 2016

It is safe to kill the ps process after your training is done (and your checkpoint is saved as well). Do you have a specific concern?

@yaroslavvb

This comment has been minimized.

yaroslavvb commented Jan 4, 2017

There's a work-around in tensorflow/tensorflow#4713 (comment)

@hustcat

This comment has been minimized.

hustcat commented Feb 17, 2017

Stop ps server gracefully is a requirement when run distributed training with kubernetes batch job. Anyone can write a detail demo base on mnist_replica?

@hustcat

This comment has been minimized.

hustcat commented Feb 20, 2017

I write a demo for MNIST_data, and it seems run OK. See dist_fifo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment