-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
🐛 Describe the bug
Hi,
I just started with ddp and still in the progress of learning the system. I am following the codes and videos from pytorch examples at: PyTorch ddp Example
With the project I am doing, I want to connect two WSLs (Ubuntu) from two Windows machine connected with same LAN network (called LuanD and LINK).
I already set up these two machines with WSL and configged etc/hosts files so they understand IP address from each other.
I also openned firewall to port 2424 on LuanD machine and forward them to IP address of WSL using netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=2424 connectaddress=WSL_IP connectport=2424
so rdzv backend can access to this port. I already checked
When I run these two commands on these two machines:
duongxuanluan@LuanD:/mnt/c/Users/AITrainer/Desktop/ddp-tutorial-series/ddp-tutorial-series$ torchrun --node_rank=0 --nnodes=2 --nproc-per-node=1 --rdzv-id=123 --rdzv-backend=c10d --rdzv-endpoint=LuanD:2424 multinode.py 10 5
duongxuanluan@LINK:/mnt/c/Users/duong/Desktop/ddp-tutorial-series/ddp-tutorial-series$ torchrun --node_rank=1 --nnodes=2 --nproc-per-node=1 --rdzv-id=123 --rdzv-backend=c10d --rdzv-endpoint=LuanD:2424 multinode.py 10 5
I got this error on LuanD machine (rank 0)
[2023-11-16 15:04:27,330] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LINK]:49747 (errno: 110 - Connection timed out).
[W socket.cpp:663] [c10d] The client socket has failed to connect to LINK.:49747 (errno: 110 - Connection timed out).
[E socket.cpp:719] [c10d] The client socket has failed to connect to any network address of (LINK., 49747).
Traceback (most recent call last):
File "/mnt/c/Users/duong/Desktop/ddp-tutorial-series/multinode.py", line 112, in <module>
main(args.save_every, args.total_epochs, args.batch_size)
File "/mnt/c/Users/duong/Desktop/ddp-tutorial-series/multinode.py", line 96, in main
ddp_setup()
File "/mnt/c/Users/duong/Desktop/ddp-tutorial-series/multinode.py", line 14, in ddp_setup
init_process_group(backend="nccl")
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
return TCPStore(
^^^^^^^^^
RuntimeError: The client socket has failed to connect to any network address of (LINK., 49747). The client socket has failed to connect to LINK.:49747 (errno: 110 - Connection timed out).
[2023-11-16 15:09:12,582] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3827) of binary: /home/duongxuanluan/anaconda3/envs/DistributeTraining/bin/python
Traceback (most recent call last):
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
multinode.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-16_15:09:12
host : LuanD.
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 3827)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
From what I guess, there is no opening for connection on the LINK machine. I tried to open the port from LINK to WSL but looks like this port number changed every time I run.
The error on other machine is:
[2023-11-16 15:04:42,818] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-11-16 15:09:18,364] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'LINK._7678_0' has failed to send a keep-alive heartbeat to the rendezvous '123' due to an error of type RendezvousConnectionError.
[2023-11-16 15:09:18,742] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7687 closing signal SIGTERM
[2023-11-16 15:09:18,760] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'LINK._7678_0' has failed to shutdown the rendezvous '123' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/duongxuanluan/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
result = self._invoke_run(role)
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 909, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
Does anyone happens to have same problem and any solution for it ?
Versions
PyTorch version: 2.1.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD FX(tm)-9590 Eight-Core Processor
CPU family: 21
Model: 2
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 0
BogoMIPS: 9379.89
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 topoext ssbd ibpb vmmcall bmi1 virt_ssbd arat
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 64 KiB (4 instances)
L1i cache: 256 KiB (4 instances)
L2 cache: 8 MiB (4 instances)
L3 cache: 8 MiB (1 instance)
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.1
[pip3] torch==2.1.0
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] Could not collect
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu