Use fixed MASTER_PORT in test_distributed #13109

pietern · 2018-10-25T05:32:26Z

Summary: The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes #12876.

Differential Revision: D10850614

Summary: The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes pytorch#12876. Differential Revision: D10850614 fbshipit-source-id: 9af1ffb44150063c1f0cbd2bf3462a13b4d55fc1

ezyang · 2018-10-25T15:00:14Z

This probably will work better, if we always manage to reliably shut down the servers after every test, but it's not the 100% right approach. Might be good enough.

I think the actually "right" way to do this is to spawn the server and have it automatically assign itself a port, then have it communicate the port to the parent process somehow, so that you can send to it. Sometimes this is inconvenient to do, in which case another robust way is to keep spawning the server configured with different ports until it succeeds.

pietern · 2018-10-25T15:22:14Z

@ezyang Agreed. I took a stab at the right approach (exactly what you mention) but it would have involved breaking up how torch.distributed.init_process_group works today. Basically we would need to create the store daemon before init_process_group and pass it, or pass a bound file descriptor or something. After 50 LOC I aborted and thought let's try this first. If it doesn't work we should revisit.

ezyang approved these changes Oct 25, 2018

View reviewed changes

facebook-github-bot closed this in 2a6431b Oct 25, 2018

ezyang added the merged label Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use fixed MASTER_PORT in test_distributed #13109

Use fixed MASTER_PORT in test_distributed #13109

Uh oh!

pietern commented Oct 25, 2018

Uh oh!

ezyang commented Oct 25, 2018

Uh oh!

pietern commented Oct 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use fixed MASTER_PORT in test_distributed #13109

Use fixed MASTER_PORT in test_distributed #13109

Uh oh!

Conversation

pietern commented Oct 25, 2018

Uh oh!

ezyang commented Oct 25, 2018

Uh oh!

pietern commented Oct 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants