Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Using IP address instead of name for TCPStore creation #31

Closed
kit1980 opened this issue Jan 17, 2020 · 2 comments
Closed

Using IP address instead of name for TCPStore creation #31

kit1980 opened this issue Jan 17, 2020 · 2 comments

Comments

@kit1980
Copy link
Member

kit1980 commented Jan 17, 2020

In my setup, I get "host not found" errors for TCPStore :

INFO 2020-01-17 00:55:01,694 Using TCPStore for c10d::Store implementation
INFO 2020-01-17 00:55:01,702 Rank 1 will conenct to TCPStore server at pytorch-elastic-test-z2b7s:47279
[ERROR] 2020-01-17 00:55:01,724 coordinator_p2p: Rank: -1
Error: Rank -1 received an Exception. Detailed message: host not found: Name or service not known

Changing to use IP address instead of name for TCPStore creation fixes the issue for me.

$ git diff
diff --git a/torchelastic/rendezvous/etcd_rendezvous.py b/torchelastic/rendezvous/etcd_rendezvous.py
index 01215b6..219bff3 100644
--- a/torchelastic/rendezvous/etcd_rendezvous.py
+++ b/torchelastic/rendezvous/etcd_rendezvous.py
@@ -1074,7 +1074,7 @@ def setup_tcpstore(rank, world_size, rdzv_version, rdzv_impl):
         # FIXME: ideally, TCPStore should have an API that
         # accepts a pre-constructed socket.
         with closing(_get_socket_with_port()) as sock:
-            host = socket.gethostname()
+            host = socket.gethostbyname(socket.gethostname())
             port = sock.getsockname()[1]

Is there a reason why we may want to use name? Or using IP address always should be OK?

Or maybe because PyTorch 1.4 has been released we should just switch to EtcdStore?

@kiukchung
Copy link
Contributor

Hey Sergii,

Yep moving to EtcdStore is the plan. There's a few validations that I'm running on this before we remove this hack:

https://github.com/pytorch/elastic/blob/master/torchelastic/rendezvous/etcd_rendezvous.py#L108

I'm curious about the hostname setup that you have. Looks like worker 1 (rank 1) can't dns resolve rank 0's hostname (pytorch-elastic-test-z2b7s). Can you try:

  • If running in a Docker container: pass --net=host to docker run. Here's the full docker run flags that I've validated on AWS.
    https://github.com/pytorch/elastic/blob/master/aws/config/user_data_worker#L55

  • On the machine running Rank 1, nslookup pytorch-elastic-test-z2b7s. If this does not resolve then the hosts must be setup with a "private" hostname with no entry in the local route table.

Going over the python docs for the socket API (https://docs.python.org/2/library/socket.html) which states:

Note: gethostname() doesn’t always return the fully qualified domain name; use getfqdn() (see above).

Its probably a good idea to change host = socket.gethostname() to host = socket.getfqdn()

@kiukchung
Copy link
Contributor

FWIW I've switched over to EtcdStore on this PR: #34
so that I can also close #11.

Will close this issue when PR-34 gets merged to trunk. Unless you have additional comments/concerns. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants