Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Change default NCCL_SOCKET_IFNAME to blacklist veth #31824

Merged
merged 8 commits into from
Jan 24, 2023

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Jan 21, 2023

Signed-off-by: amogkam amogkamsetty@yahoo.com

Closes #30333.

Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: https://github.com/anyscale/product/issues/8310.

However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333.

Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!)

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@amogkam
Copy link
Contributor Author

amogkam commented Jan 21, 2023

Passing distributed GPU training test on anyscale_default_cloud: https://buildkite.com/ray-project/release-tests-pr/builds/26067

@amogkam amogkam marked this pull request as ready for review January 21, 2023 01:56
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@cadedaniel
Copy link
Member

nice!

From the logs, looks like NCCL is still seeing veth in some of the nodes. Any idea why?

I see 12 worker logs with veth, but all 16 have ens3

$ ag --case-sensitive --literal 'Using [0]ens' | wc -l
      16

$ ag --case-sensitive 'veth' | wc -l
      12
:job_id:03000000
:actor_name:RayTrainWorker
ip-172-31-201-206:7411:7871 [0] NCCL INFO cudaDriverVersion 11060
ip-172-31-201-206:7411:7871 [0] NCCL INFO Bootstrap : Using ens3:172.31.201.206<0>
ip-172-31-201-206:7411:7871 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ip-172-31-201-206:7411:7885 [0] NCCL INFO Failed to open libibverbs.so[.1]
ip-172-31-201-206:7411:7885 [0] NCCL INFO NET/Socket : Using [0]ens3:172.31.201.206<0> [1]veth01fa248:fe80::343a:e3ff:fe09:aeff%veth01fa248<0>
ip-172-31-201-206:7411:7885 [0] NCCL INFO Using network Socket
ip-172-31-201-206:7411:7885 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
ip-172-31-201-206:7411:7885 [0] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via SHM/direct/direct
ip-172-31-201-206:7411:7885 [0] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via SHM/direct/direct
ip-172-31-201-206:7411:7885 [0] NCCL INFO Connected all rings
ip-172-31-201-206:7411:7885 [0] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via SHM/direct/direct
ip-172-31-201-206:7411:7885 [0] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via SHM/direct/direct
ip-172-31-201-206:7411:7885 [0] NCCL INFO Connected all trees
ip-172-31-201-206:7411:7885 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
ip-172-31-201-206:7411:7885 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
ip-172-31-201-206:7411:7885 [0] NCCL INFO comm 0x7f4cd4008ec0 rank 2 nranks 16 cudaDev 0 busId 1d0 - Init COMPLETE
ip-172-31-201-206:7411:7887 [0] NCCL INFO [Service thread] Connection closed by localRank 2
ip-172-31-201-206:7411:7411 [0] NCCL INFO comm 0x7f4cd4008ec0 rank 2 nranks 16 cudaDev 0 busId 1d0 - Abort COMPLETE

Copy link
Member

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. we're not sure if that line is printed before or after blacklist. we should catch it in nightly if the order is nondeterministic.

can we file an issue against product to remove the virtual ethernet interface? then Ray Train can get out of the way of users completely here (default to NCCL behavior).

# "en".
DEFAULT_NCCL_SOCKET_IFNAME = "en,eth,bond"
# Blacklist virtualized networking.
DEFAULT_NCCL_SOCKET_IFNAME = "^lo,docker,vethc"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually the reason is that it should be veth not vethc. c happened to be the first hex character of the ID I sent you.

that also explains why it was present in only 12 or the 16 -- the other 4 must have had c as the first character of the id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch...then now I'm wondering why this passed. Maybe this is not needed at all 🤔. Let me run the test with this default removed.

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
@amogkam
Copy link
Contributor Author

amogkam commented Jan 23, 2023

@amogkam
Copy link
Contributor Author

amogkam commented Jan 23, 2023

Copy link
Member

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue against product to remove the virtual network interface?

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question, the title says we blacklist vethc - instead we whitelist veth. Is that what we want? Edit: Sorry misread the line

@amogkam amogkam changed the title [Train] Change default NCCL_SOCKET_IFNAME to blacklist vethc [Train] Change default NCCL_SOCKET_IFNAME to blacklist veth Jan 23, 2023
@amogkam
Copy link
Contributor Author

amogkam commented Jan 23, 2023

Updated the title-- it should be veth

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@amogkam amogkam merged commit 9bebf57 into ray-project:master Jan 24, 2023
@amogkam amogkam deleted the default-nccl branch January 24, 2023 01:04
cadedaniel pushed a commit to cadedaniel/ray that referenced this pull request Mar 22, 2023
…project#31824)

Signed-off-by: amogkam amogkamsetty@yahoo.com

Closes ray-project#30333.

Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310.

However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: ray-project#30333.

Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!)

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this pull request Mar 28, 2023
…project#31824)

Signed-off-by: amogkam amogkamsetty@yahoo.com

Closes ray-project#30333.

Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310.

However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: ray-project#30333.

Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!)

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
jjyao pushed a commit that referenced this pull request Jan 8, 2024
When running distributed Pytorch without GPUs, Pytorch selects a localhost interface for gloo (i.e. 127.0.0.1:XXX), breaking distributed training. This method in Pytorch can yield the incorrect interface when a) the the hostname resolves locally to the loopback address or b) when hostname lookups fail.

This is scoped to DBR specifically because eth0 is guaranteed to exist there. Pytorch+Gloo does not support deny-listing like NCCL (as we do in #31824) because Pytorch directly uses the environment variable GLOO_SOCKET_IFNAME as the interface to use https://github.com/pytorch/pytorch/blob/7956ca16e649d86cbf11b6e122090fa05678fac3/torch/csrc/distributed/c10d/init.cpp#L2243.

Signed-off-by: Ian Rodney <ian.rodney@gmail.com>
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024
…2202)

When running distributed Pytorch without GPUs, Pytorch selects a localhost interface for gloo (i.e. 127.0.0.1:XXX), breaking distributed training. This method in Pytorch can yield the incorrect interface when a) the the hostname resolves locally to the loopback address or b) when hostname lookups fail.

This is scoped to DBR specifically because eth0 is guaranteed to exist there. Pytorch+Gloo does not support deny-listing like NCCL (as we do in ray-project#31824) because Pytorch directly uses the environment variable GLOO_SOCKET_IFNAME as the interface to use https://github.com/pytorch/pytorch/blob/7956ca16e649d86cbf11b6e122090fa05678fac3/torch/csrc/distributed/c10d/init.cpp#L2243.

Signed-off-by: Ian Rodney <ian.rodney@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] Ray Train Constants on Multinode causes NCCL Error
4 participants