-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR][Train] Add default network setting for pytorch multi-host training #26663
Comments
Yes this is very confusing dev UX that both Richard and I got trapped into while doing CUJs .. i can't imagine how our users gonna figure this out easily. @amogkam do we have any current mechanism in Ray Train to set this automatically -- how users resolve this in the past ? |
In general, I believe NCCL should automatically be doing this. But agreed we should do this programmatically for safety. This has only surfaced as a problem in Anyscale though, specifically with |
@amogkam can you just add this as a Train FAQ? |
cc @ilee300a FYI |
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com Automatically set NCCL_SOCKET_IFNAME to prefer ethernet. Also adds a FAQ section in the docs on how to diagnose this issue. Closes #26663
What happened + What you expected to happen
Currently user using AIR to do pytorch distributed training across multiple hosts will run into ganging issue if ens was not set or not set correctly. This hurts dev UX.
In https://github.com/inkawhich/pt-distributed-tutorial/blob/master/pytorch-aws-distributed-tutorial.py it mentioned about using
ipconfig
to find and set right interface with env varNCCL_SOCKET_IFNAME
NOTE: Different hardware might require different values. g3 uses ens3 while g4dn uses ens5.
Versions / Dependencies
master
Reproduction script
https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py
Running pytorch_training_e2e.py with works across multiple hosts
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: