You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
In my setup, I get "host not found" errors for TCPStore :
INFO 2020-01-17 00:55:01,694 Using TCPStore for c10d::Store implementation
INFO 2020-01-17 00:55:01,702 Rank 1 will conenct to TCPStore server at pytorch-elastic-test-z2b7s:47279
[ERROR] 2020-01-17 00:55:01,724 coordinator_p2p: Rank: -1
Error: Rank -1 received an Exception. Detailed message: host not found: Name or service not known
Changing to use IP address instead of name for TCPStore creation fixes the issue for me.
$ git diff
diff --git a/torchelastic/rendezvous/etcd_rendezvous.py b/torchelastic/rendezvous/etcd_rendezvous.py
index 01215b6..219bff3 100644
--- a/torchelastic/rendezvous/etcd_rendezvous.py
+++ b/torchelastic/rendezvous/etcd_rendezvous.py
@@ -1074,7 +1074,7 @@ def setup_tcpstore(rank, world_size, rdzv_version, rdzv_impl):
# FIXME: ideally, TCPStore should have an API that
# accepts a pre-constructed socket.
with closing(_get_socket_with_port()) as sock:
- host = socket.gethostname()
+ host = socket.gethostbyname(socket.gethostname())
port = sock.getsockname()[1]
Is there a reason why we may want to use name? Or using IP address always should be OK?
Or maybe because PyTorch 1.4 has been released we should just switch to EtcdStore?
The text was updated successfully, but these errors were encountered:
I'm curious about the hostname setup that you have. Looks like worker 1 (rank 1) can't dns resolve rank 0's hostname (pytorch-elastic-test-z2b7s). Can you try:
On the machine running Rank 1, nslookup pytorch-elastic-test-z2b7s. If this does not resolve then the hosts must be setup with a "private" hostname with no entry in the local route table.
In my setup, I get "host not found" errors for TCPStore :
Changing to use IP address instead of name for TCPStore creation fixes the issue for me.
Is there a reason why we may want to use name? Or using IP address always should be OK?
Or maybe because PyTorch 1.4 has been released we should just switch to EtcdStore?
The text was updated successfully, but these errors were encountered: