Skip to content

Conversation

pietern
Copy link
Contributor

@pietern pietern commented Oct 25, 2018

Summary: The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes #12876.

Differential Revision: D10850614

Summary: The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes pytorch#12876.

Differential Revision: D10850614

fbshipit-source-id: 9af1ffb44150063c1f0cbd2bf3462a13b4d55fc1
@ezyang
Copy link
Contributor

ezyang commented Oct 25, 2018

This probably will work better, if we always manage to reliably shut down the servers after every test, but it's not the 100% right approach. Might be good enough.

I think the actually "right" way to do this is to spawn the server and have it automatically assign itself a port, then have it communicate the port to the parent process somehow, so that you can send to it. Sometimes this is inconvenient to do, in which case another robust way is to keep spawning the server configured with different ports until it succeeds.

@pietern
Copy link
Contributor Author

pietern commented Oct 25, 2018

@ezyang Agreed. I took a stab at the right approach (exactly what you mention) but it would have involved breaking up how torch.distributed.init_process_group works today. Basically we would need to create the store daemon before init_process_group and pass it, or pass a bound file descriptor or something. After 50 LOC I aborted and thought let's try this first. If it doesn't work we should revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TCPStore: Address already in use in test_distributed

2 participants