Use a single listening socket per device #361

xunnanxu · 2023-05-14T23:52:51Z

Summary:
For listening port, the mesh connection currently leads to O(mn) port usage where m is the num of ranks per host and n is the total num of ranks.

Even when SO_REUSEADDR is set, this only allows those used by sockets in ESTABLISHED or TIME-WAIT state to be reused. Hence in large training jobs, or even testing env where a lot of processes are packed on the same machine, we would soon run out of ephemeral ports (e.g. a local 200-process would need 40k ephemeral ports just for listening which is obviously very inefficient and most likely outside the range of allowed ephemeral ports in linux systems, which is typically around 32K).

We fix this by using a single listening socket per device instance instead of using one per pair. Connections to all pair instances are multiplexed on a single listening socket by adding a sequence number to the address struct. For ranks packed on the same host with the same interface address, we use a seq number to differentiate between those so each would have a unique Address object assoc.

During actual connection, each pair would have one side as Initiator and the other as Listener.
We assign the roles purely based on arbitrary address comparison logic. The exact result doesn't matter since TCP is bidirectional, so long as they are consistent for a pair.
The initiator will connect to the listed address and write a few bytes containing the sequence number. The listener waits for a connection to the shared listening socket where it can read that
same sequence number. Once the listener side establishes the connection, that Pair would get promoted via the deferred callback to handle the actual connection post rendezvous.

Credit to original author: Pieter Noordhuis pietern

This diff cleans up a few things and resolves conflicts.

Differential Revision: D45437709

Summary: For listening port, the mesh connection currently leads to O(mn) port usage where m is the num of ranks per host and n is the total num of ranks. Even when `SO_REUSEADDR` is set, this only allows those used by sockets in `ESTABLISHED` or `TIME-WAIT` state to be reused. Hence in large training jobs, or even testing env where a lot of processes are packed on the same machine, we would soon run out of ephemeral ports (e.g. a local 200-process would need 40k ephemeral ports just for listening which is obviously very inefficient and most likely outside the range of allowed ephemeral ports in linux systems, which is typically around 32K). We fix this by using a single listening socket per device instance instead of using one per pair. Connections to all pair instances are multiplexed on a single listening socket by adding a sequence number to the address struct. For ranks packed on the same host with the same interface address, we use a seq number to differentiate between those so each would have a unique `Address` object assoc. During actual connection, each pair would have one side as `Initiator` and the other as `Listener`. We assign the roles purely based on arbitrary address comparison logic. The exact result doesn't matter since TCP is bidirectional, so long as they are consistent for a pair. The initiator will connect to the listed address and write a few bytes containing the sequence number. The listener waits for a connection to the shared listening socket where it can read that same sequence number. Once the listener side establishes the connection, that `Pair` would get promoted via the deferred callback to handle the actual connection post rendezvous. Credit to original author: Pieter Noordhuis pietern This diff cleans up a few things and resolves conflicts. Differential Revision: D45437709 fbshipit-source-id: 193fecb7d58e1d3a3acce82614f62d56865c2251

facebook-github-bot · 2023-05-14T23:53:50Z

This pull request was exported from Phabricator. Differential Revision: D45437709

facebook-github-bot · 2023-05-15T17:54:00Z

This pull request has been merged in 31b1f02.

Summary: Pull Request resolved: pytorch#101438 the prev tests are all too small in scale and cannot show the limitation of gloo previously. This verifies the reuse of listening port is needed. Related PR: pytorch/gloo#361 Test Plan: ``` buck test mode/opt caffe2/test/distributed:c10d_spawn ``` Differential Revision: D45639555 fbshipit-source-id: 0f4822e9067ff6f16693e2affe969e6c96d2f926

facebook-github-bot added CLA Signed fb-exported labels May 14, 2023

facebook-github-bot closed this in 31b1f02 May 15, 2023

facebook-github-bot added the Merged label May 15, 2023

xunnanxu mentioned this pull request May 15, 2023

[gloo] add a large scale gloo test to c10d pytorch/pytorch#101438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use a single listening socket per device #361

Use a single listening socket per device #361

Uh oh!

xunnanxu commented May 14, 2023

Uh oh!

facebook-github-bot commented May 14, 2023

Uh oh!

facebook-github-bot commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use a single listening socket per device #361

Use a single listening socket per device #361

Uh oh!

Conversation

xunnanxu commented May 14, 2023

Uh oh!

facebook-github-bot commented May 14, 2023

Uh oh!

facebook-github-bot commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants