Skip to content

Conversation

@xunnanxu
Copy link
Contributor

Summary:
For listening port, the mesh connection currently leads to O(mn) port usage where m is the num of ranks per host and n is the total num of ranks.

Even when SO_REUSEADDR is set, this only allows those used by sockets in ESTABLISHED or TIME-WAIT state to be reused. Hence in large training jobs, or even testing env where a lot of processes are packed on the same machine, we would soon run out of ephemeral ports (e.g. a local 200-process would need 40k ephemeral ports just for listening which is obviously very inefficient and most likely outside the range of allowed ephemeral ports in linux systems, which is typically around 32K).

We fix this by using a single listening socket per device instance instead of using one per pair. Connections to all pair instances are multiplexed on a single listening socket by adding a sequence number to the address struct. For ranks packed on the same host with the same interface address, we use a seq number to differentiate between those so each would have a unique Address object assoc.

During actual connection, each pair would have one side as Initiator and the other as Listener.
We assign the roles purely based on arbitrary address comparison logic. The exact result doesn't matter since TCP is bidirectional, so long as they are consistent for a pair.
The initiator will connect to the listed address and write a few bytes containing the sequence number. The listener waits for a connection to the shared listening socket where it can read that
same sequence number. Once the listener side establishes the connection, that Pair would get promoted via the deferred callback to handle the actual connection post rendezvous.

Credit to original author: Pieter Noordhuis pietern

This diff cleans up a few things and resolves conflicts.

Differential Revision: D45437709

Summary:
For listening port, the mesh connection currently leads to O(mn) port usage where m is the num of ranks per host and n is the total num of ranks.

Even when `SO_REUSEADDR` is set, this only allows those used by sockets in `ESTABLISHED` or `TIME-WAIT` state to be reused. Hence in large training jobs, or even testing env where a lot of processes are packed on the same machine, we would soon run out of ephemeral ports (e.g. a local 200-process would need 40k ephemeral ports just for listening which is obviously very inefficient and most likely outside the range of allowed ephemeral ports in linux systems, which is typically around 32K).

We fix this by using a single listening socket per device instance instead of using one per pair. Connections to all pair instances are multiplexed on a single listening socket by adding a sequence number to the address struct. For ranks packed on the same host with the same interface address, we use a seq number to differentiate between those so each would have a unique `Address` object assoc.

During actual connection, each pair would have one side as `Initiator` and the other as `Listener`.
We assign the roles purely based on arbitrary address comparison logic. The exact result doesn't matter since TCP is bidirectional, so long as they are consistent for a pair.
The initiator will connect to the listed address and write a few bytes containing the sequence number. The listener waits for a connection to the shared listening socket where it can read that
same sequence number. Once the listener side establishes the connection, that `Pair` would get promoted via the deferred callback to handle the actual connection post rendezvous.

Credit to original author: Pieter Noordhuis pietern

This diff cleans up a few things and resolves conflicts.

Differential Revision: D45437709

fbshipit-source-id: 193fecb7d58e1d3a3acce82614f62d56865c2251
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D45437709

@facebook-github-bot
Copy link

This pull request has been merged in 31b1f02.

xunnanxu added a commit to xunnanxu/pytorch that referenced this pull request Jun 6, 2023
Summary:
Pull Request resolved: pytorch#101438

the prev tests are all too small in scale and cannot show the
limitation of gloo previously. This verifies the reuse of listening port is needed.

Related PR: pytorch/gloo#361

Test Plan:
```
buck test mode/opt caffe2/test/distributed:c10d_spawn
```

Differential Revision: D45639555

fbshipit-source-id: 0f4822e9067ff6f16693e2affe969e6c96d2f926
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants