New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to create Gloo new group after initialized with NCCL #68726
Comments
Hmm, I am not able to reproduce the issue using the latest version of PyTorch nightly. I ran the following script: https://github.com/rohan-varma/torch-script/blob/master/training_script.py via
on a GPU cluster, where test.sh is https://github.com/rohan-varma/torch-script/blob/master/test.sh, and the output was -
A significant overhaul of the TCPStore has landed recently which improves the error logging, could you try on the latest/nightly PyTorch so we can get more error details? #68226 |
In addition, new_group will use the same default_store as the default process group, so it is quite strange that it would attempt to connect to the wrong address. Can you print out $MASTER_ADDR and $MASTER_PORT in both of your worker scripts? |
@rohan-varma Thank you for your reply! I've double checked the cluster and found that the real error is I could not establish gloo comm group. And the reason for that is the hostname of the ips are all set to the same and point to 127.0.0.1. Therefore gloo resolved the hostname to localhost and connected to itself... I've manually changed the hostname at it works now :) Thank you for your help! And I wonder if there is a way to pass the ip directly to gloo? |
I found if we set |
This is happening to me too. If I want DEBUG info, can't I avoid that? Is this issue detected? |
This is not currently possible |
@zhuzilin How did you manage to change the hostname so it works? Did you ever find a way to pass an IP directly to gloo? |
@trias702 You can try manually setting the network interface as follows: First, use |
This is an unfortunate bug. The only time you want to set |
馃悰 Bug
In our project Tencent/PatrickStar, we need to create a NCCL comm group and a Gloo comm group in order to utilize both GPU reduce scatter and CPU comm operations. However, it seems that after initialized with NCCL, the Gloo group could not detect the master address and master port, but instead using localhost (127.0.0.1).
I'm afraid the reason is that the NCCL store and Gloo store are not compatible with each other so that the new Gloo group could not read the master addr saved by NCCL group.
The relevant error message is:
To Reproduce
Run :
Note that the code should work with only one machine or the second new group is created with backend "nccl".
Expected behavior
User should be allowed to create 2 comm group of different types.
Environment
I'm using the NGC container:
nvcr.io/nvidia/pytorch:21.09-py3
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
The text was updated successfully, but these errors were encountered: