Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker waiting for ps server #2

Open
shenqixiaojiang opened this issue May 27, 2018 · 6 comments
Open

Worker waiting for ps server #2

shenqixiaojiang opened this issue May 27, 2018 · 6 comments

Comments

@shenqixiaojiang
Copy link
Owner

shenqixiaojiang commented May 27, 2018

Worker waiting for ps server - CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
在原始的机器上测试没问题,但迁移到新的机器上时出现上述问题。也就是worker无法接受到ps的消息。

@shenqixiaojiang
Copy link
Owner Author

shenqixiaojiang commented May 27, 2018

第一步:测试在localhost情况下的ps、worker连接情况即ps0和worker0是同一个gpu机器——依旧waiting。
第二步:设置新机器自己连接自己时,免密码登录。主要是修改~/.ssh/authorized_keys和~/.ssh/known_hosts(初次跳转登录询问“yes or no”,将询问后的known_hosts添加的内容拷贝到其他机器上就可以)——依旧waiting。

cat id_rsa.pub >> authorized_keys

第三步:统一新机器和旧机器之间的pip list——依旧waiting。

@shenqixiaojiang
Copy link
Owner Author

shenqixiaojiang commented May 27, 2018

第四步:检查代码的问题。将最简单的MNIST代码分别在旧机器上和新机器上测试,发现旧机器work,新机器waiting。因此确定不是代码的问题,而是机器配置的问题。

@shenqixiaojiang
Copy link
Owner Author

shenqixiaojiang commented May 27, 2018

第五步:确定是网络配置的问题。联想到Tensorflow使用的是grpc通信,就去了解grpc通信。
然后发现grpc有example,测试Python版的helloword,发现grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Connect Failed)>,因此去google这个问题,发现设置http_proxy后,ENV下的no_proxy变量会被重置。

@shenqixiaojiang
Copy link
Owner Author

shenqixiaojiang commented May 27, 2018

妥协解决方案:修改~/.bashrc不再设置代理,即取消export http_proxy=xx:11080
这样新机器便可以和旧机器一样可以进行多机多卡的训练了。

@shenqixiaojiang
Copy link
Owner Author

grpc通信问题参考,link1link2

@shenqixiaojiang
Copy link
Owner Author

shenqixiaojiang commented Jun 7, 2018

主要出现在ps和worker为同一台机器时:当ps和worker环境不一致的情况,也会出现等待的问题(是一方等待另一方的问题而不是相互等待),因此检查下 $LD_LIBRARY_PATH是否一致。

echo $LD_LIBRARY_PATH

可以logout,然后重新登陆下,来保证运行环境的一致性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant