Use a single listening socket per device #361
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
For listening port, the mesh connection currently leads to O(mn) port usage where m is the num of ranks per host and n is the total num of ranks.
Even when
SO_REUSEADDRis set, this only allows those used by sockets inESTABLISHEDorTIME-WAITstate to be reused. Hence in large training jobs, or even testing env where a lot of processes are packed on the same machine, we would soon run out of ephemeral ports (e.g. a local 200-process would need 40k ephemeral ports just for listening which is obviously very inefficient and most likely outside the range of allowed ephemeral ports in linux systems, which is typically around 32K).We fix this by using a single listening socket per device instance instead of using one per pair. Connections to all pair instances are multiplexed on a single listening socket by adding a sequence number to the address struct. For ranks packed on the same host with the same interface address, we use a seq number to differentiate between those so each would have a unique
Addressobject assoc.During actual connection, each pair would have one side as
Initiatorand the other asListener.We assign the roles purely based on arbitrary address comparison logic. The exact result doesn't matter since TCP is bidirectional, so long as they are consistent for a pair.
The initiator will connect to the listed address and write a few bytes containing the sequence number. The listener waits for a connection to the shared listening socket where it can read that
same sequence number. Once the listener side establishes the connection, that
Pairwould get promoted via the deferred callback to handle the actual connection post rendezvous.Credit to original author: Pieter Noordhuis pietern
This diff cleans up a few things and resolves conflicts.
Differential Revision: D45437709