-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP multi host with single GPU each. #78047
Comments
I think somewhere id lookup are really messed up. Traceback (most recent call last): |
Hi, as the first error log indicates, this call in user code This could happen when |
Folks, So if you have two nodes, each node has only 1 GPU. Based on the logic. Node a - rank 0 ( master) I managed to make it work but what is strange is I had to set all env and all values. 2). If you check all examples, sometimes rank passed as to.device() some time rank passed in devices_ids. os.environ["NCCL_DEBUG"] = "INFO" I think it makes more sense to be concrete in typing. i.e 0 rank , 0 local rank , 0 cuda device id.... For example, the torch already has an abstract device. So it makes sense always to use if you think about it in a logical sense it is very strange. In essence, it makes much more sense to have some sort of priority instead
Imagine you have 4 servers and each 1 GPU so you have all local rank 0, GPU device id 0. ;-)) Right now device_ids , cuda device id, local_rank IMHO very ambiguous and can be passed |
🐛 Describe the bug
Folks,
I have two hosts, and each host has a single GPU, I'm using an example.
https://github.com/sudomaze/ttorch/blob/main/examples/ddp/run.py
if I use the master node
rank 0 , world_size 2
worker
rank 1, world_size 2
if I use (master start training loop but worker never connected)
rank 0 , world_size 1
rank 1 , world_size 1
Stack trace for case one. Note that master goes and waits only if world_size 2.
Master node.
I manage a bit narrow it down.
This a case
Versions
Collecting environment information...
PyTorch version: 1.11.0+cu115
Is debug build: False
CUDA used to build PyTorch: 11.5
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04 LTS (x86_64)
GCC version: (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.4 (main, Apr 2 2022, 09:04:19) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 512.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0+cu115
[pip3] torchaudio==0.11.0+cu115
[pip3] torchvision==0.12.0+cu115
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501
The text was updated successfully, but these errors were encountered: