This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
Use EtcdStore rather than TCPStore when using etcd_rdzv #34
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Background:
The rdzv interface returns a store as part of the
next()
API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of thec10d::Store
interface because no trampoline class existed in the pybind definition.TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in
serial
mode. This is no longer necessary forEtcdStore
.There are two pending github issues that need this change (see attached tasks)
Differential Revision: D19511842