Torchrun distribute training on Windows WSL

### 🐛 Describe the bug

Hi, 
I just started with ddp and still in the progress of learning the system. I am following the codes and videos from pytorch examples at: [PyTorch ddp Example](https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py) 
With the project I am doing, I want to connect two WSLs (Ubuntu) from two Windows machine connected with same LAN network (called LuanD and LINK). 
I already set up these two machines with WSL and configged etc/hosts files so they understand IP address from each other. 
I also openned firewall to port 2424 on LuanD machine and forward them to IP address of WSL using ```netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=2424 connectaddress=WSL_IP connectport=2424``` so rdzv backend can access to this port. I already checked 
When I run these two commands on these two machines: 
```
duongxuanluan@LuanD:/mnt/c/Users/AITrainer/Desktop/ddp-tutorial-series/ddp-tutorial-series$ torchrun --node_rank=0 --nnodes=2 --nproc-per-node=1 --rdzv-id=123 --rdzv-backend=c10d --rdzv-endpoint=LuanD:2424 multinode.py 10 5

duongxuanluan@LINK:/mnt/c/Users/duong/Desktop/ddp-tutorial-series/ddp-tutorial-series$ torchrun --node_rank=1 --nnodes=2 --nproc-per-node=1 --rdzv-id=123 --rdzv-backend=c10d --rdzv-endpoint=LuanD:2424 multinode.py 10 5
```
I got this error on LuanD machine (rank 0)

```
[2023-11-16 15:04:27,330] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [LINK]:49747 (errno: 110 - Connection timed out).
[W socket.cpp:663] [c10d] The client socket has failed to connect to LINK.:49747 (errno: 110 - Connection timed out).
[E socket.cpp:719] [c10d] The client socket has failed to connect to any network address of (LINK., 49747).
Traceback (most recent call last):
  File "/mnt/c/Users/duong/Desktop/ddp-tutorial-series/multinode.py", line 112, in <module>
    main(args.save_every, args.total_epochs, args.batch_size)
  File "/mnt/c/Users/duong/Desktop/ddp-tutorial-series/multinode.py", line 96, in main
    ddp_setup()
  File "/mnt/c/Users/duong/Desktop/ddp-tutorial-series/multinode.py", line 14, in ddp_setup
    init_process_group(backend="nccl")
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: The client socket has failed to connect to any network address of (LINK., 49747). The client socket has failed to connect to LINK.:49747 (errno: 110 - Connection timed out).
[2023-11-16 15:09:12,582] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3827) of binary: /home/duongxuanluan/anaconda3/envs/DistributeTraining/bin/python
Traceback (most recent call last):
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/duongxuanluan/anaconda3/envs/DistributeTraining/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
multinode.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-16_15:09:12
  host      : LuanD.
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 3827)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```
From what I guess, there is no opening for connection on the LINK machine. I tried to open the port from LINK to WSL but looks like this port number changed every time I run. 
The error on other machine is:

```
[2023-11-16 15:04:42,818] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-11-16 15:09:18,364] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'LINK._7678_0' has failed to send a keep-alive heartbeat to the rendezvous '123' due to an error of type RendezvousConnectionError.
[2023-11-16 15:09:18,742] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7687 closing signal SIGTERM
[2023-11-16 15:09:18,760] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'LINK._7678_0' has failed to shutdown the rendezvous '123' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/duongxuanluan/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 909, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
    self._state_holder.sync()
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
    get_response = self._backend.get_state()
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/home/duongxuanluan/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
```
Does anyone happens to have same problem and any solution for it ?




 



### Versions

PyTorch version: 2.1.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Vendor ID:                       AuthenticAMD
Model name:                      AMD FX(tm)-9590 Eight-Core Processor
CPU family:                      21
Model:                           2
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
Stepping:                        0
BogoMIPS:                        9379.89
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 topoext ssbd ibpb vmmcall bmi1 virt_ssbd arat
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       64 KiB (4 instances)
L1i cache:                       256 KiB (4 instances)
L2 cache:                        8 MiB (4 instances)
L3 cache:                        8 MiB (1 instance)
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.1
[pip3] torch==2.1.0
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] Could not collect

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torchrun distribute training on Windows WSL #113868

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Torchrun distribute training on Windows WSL #113868

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions