-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker waiting for ps server #2
Comments
第一步:测试在localhost情况下的ps、worker连接情况即ps0和worker0是同一个gpu机器——依旧waiting。
第三步:统一新机器和旧机器之间的pip list——依旧waiting。 |
第四步:检查代码的问题。将最简单的MNIST代码分别在旧机器上和新机器上测试,发现旧机器work,新机器waiting。因此确定不是代码的问题,而是机器配置的问题。 |
第五步:确定是网络配置的问题。联想到Tensorflow使用的是grpc通信,就去了解grpc通信。 |
妥协解决方案:修改~/.bashrc不再设置代理,即取消export http_proxy=xx:11080 |
主要出现在ps和worker为同一台机器时:当ps和worker环境不一致的情况,也会出现等待的问题(是一方等待另一方的问题而不是相互等待),因此检查下 $LD_LIBRARY_PATH是否一致。
可以logout,然后重新登陆下,来保证运行环境的一致性。 |
Worker waiting for ps server - CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
在原始的机器上测试没问题,但迁移到新的机器上时出现上述问题。也就是worker无法接受到ps的消息。
The text was updated successfully, but these errors were encountered: