This repository was archived by the owner on Jan 6, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 100
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
rendezvous: _matches_machine_hostname doesn't resolve hostnames fully #165
Copy link
Copy link
Open
Description
🐛 Bug
Component (check all that applies):
-
state api
-
train_step api
-
train_loop
-
rendezvous
-
checkpoint
-
rollback
-
metrics
-
petctl
-
examples
-
docker
- other
To Reproduce
Steps to reproduce the behavior:
- Launch a 2 node job on Kubernetes+Volcano
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo hello
- rendezvous times out since the rank 0 host doesn't realize it's the master due to insufficient hostname resolution
root@sh-db2kkt73p534vd-sh-0-0:/app# echo $VC_SH_0_HOSTS
sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd
root@sh-db2kkt73p534vd-sh-0-0:/app# hostname
sh-db2kkt73p534vd-sh-0-0
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/resolv.conf
nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
192.168.15.246 sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd.default.svc.cluster.local sh-db2kkt73p534vd-sh-0-0
The hostname is sh-db2kkt73p534vd-sh-0-0
but Volcano gives the addresss sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd
. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.
We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.
Expected behavior
It realizes the host name is the current node and starts the c10d
server.
Environment
- torchelastic version (e.g. 0.1.0rc1):
- OS (e.g., Linux): Linux sh-db2kkt73p534vd-sh-0-0 4.14.241-184.433.amzn2.x86_64 [torchelastic][circleci] Fix etcd download path #1 SMP Wed Aug 4 14:35:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- How you installed torchelastic (
conda
,pip
, source,docker
): docker - Docker image and tag (if using docker): https://github.com/pytorch/torchx/pkgs/container/torchx/15644476?tag=0.1.2dev0
- Build command you used (if compiling from source):
- Git commit (if installed from source):
- Python version: 3.7.11
- CUDA/cuDNN version:
- GPU models and configuration:
- Execution environment (on-prem, aws, etc): EKS + Volcano
- Any other relevant information:
Additional context
Metadata
Metadata
Assignees
Labels
No labels