Skip to content

Commit

Permalink
[c10d] Increase socket buffer size to allow ProcessGroup init up to 1…
Browse files Browse the repository at this point in the history
…2k ranks

The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.

split the original diff for OSS vs. internal.

Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.

Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/)

[ghstack-poisoned]
  • Loading branch information
XilunWu committed Aug 24, 2023
1 parent 2515ab9 commit b5ec08c
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions torch/csrc/distributed/c10d/socket.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -562,7 +562,7 @@ bool SocketListenOp::tryListen(const ::addrinfo& addr) {
}

// NOLINTNEXTLINE(bugprone-argument-comment)
if (::listen(socket_->handle(), /*backlog=*/2048) != 0) {
if (::listen(socket_->handle(), -1 /* backlog */) != 0) {
recordError(
"The server socket has failed to listen on {} {}.",
addr,
Expand Down Expand Up @@ -614,7 +614,7 @@ std::unique_ptr<SocketImpl> SocketListenFromFdOp::run() const {
expected_port_)};
}

if (::listen(socket->handle(), 2048 /* backlog */) != 0) {
if (::listen(socket->handle(), -1 /* backlog */) != 0) {
throw SocketError{fmt::format(
"Failed to listen on socket initialized from fd {}: {}.",
socket->handle(),
Expand Down

0 comments on commit b5ec08c

Please sign in to comment.