Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][Train] Add default network setting for pytorch multi-host training #26663

Closed
jiaodong opened this issue Jul 18, 2022 · 5 comments · Fixed by #28633
Closed

[AIR][Train] Add default network setting for pytorch multi-host training #26663

jiaodong opened this issue Jul 18, 2022 · 5 comments · Fixed by #28633
Assignees
Labels
bug Something that is supposed to be working; but isn't

Comments

@jiaodong
Copy link
Member

What happened + What you expected to happen

Currently user using AIR to do pytorch distributed training across multiple hosts will run into ganging issue if ens was not set or not set correctly. This hurts dev UX.

In https://github.com/inkawhich/pt-distributed-tutorial/blob/master/pytorch-aws-distributed-tutorial.py it mentioned about using ipconfig to find and set right interface with env var NCCL_SOCKET_IFNAME

ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 172.31.65.108  netmask 255.255.224.0  broadcast 172.31.95.255
        inet6 fe80::865:1bff:fe4c:2e03  prefixlen 64  scopeid 0x20<link>
        ether 0a:65:1b:4c:2e:03  txqueuelen 1000  (Ethernet)
        RX packets 12552852  bytes 18852483513 (18.8 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4753080  bytes 6866629048 (6.8 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    runtime_env = {"env_vars": {"NCCL_SOCKET_IFNAME": "ens3"}}
    ray.init(runtime_env=runtime_env)

NOTE: Different hardware might require different values. g3 uses ens3 while g4dn uses ens5.

Versions / Dependencies

master

Reproduction script

https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py

Running pytorch_training_e2e.py with works across multiple hosts

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@jiaodong jiaodong added bug Something that is supposed to be working; but isn't air labels Jul 18, 2022
@matthewdeng
Copy link
Contributor

It would be cool to find a way to exhaustively codify this 🤔

We could also add this to the FAQ as a stopgap, but I think the hard part for the user is identifying this issue in the first place...

cc @krfricke @amogkam

@jiaodong
Copy link
Member Author

Yes this is very confusing dev UX that both Richard and I got trapped into while doing CUJs .. i can't imagine how our users gonna figure this out easily. @amogkam do we have any current mechanism in Ray Train to set this automatically -- how users resolve this in the past ?

@amogkam
Copy link
Contributor

amogkam commented Jul 18, 2022

In general, I believe NCCL should automatically be doing this. But agreed we should do this programmatically for safety. This has only surfaced as a problem in Anyscale though, specifically with anyscale_default_cloud, and thus so far has only affected release tests and not any users.

@matthewdeng
Copy link
Contributor

@amogkam can you just add this as a Train FAQ?

@xwjiang2010
Copy link
Contributor

cc @ilee300a FYI

amogkam added a commit that referenced this issue Sep 21, 2022
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com

Automatically set NCCL_SOCKET_IFNAME to prefer ethernet. Also adds a FAQ section in the docs on how to diagnose this issue.

Closes #26663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants