This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use EtcdStore rather than TCPStore when using etcd_rdzv (#34)
Summary: Pull Request resolved: #34 Background: The rdzv interface returns a store as part of the `next()` API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of the `c10d::Store` interface because no trampoline class existed in the pybind definition. TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in `serial`mode. This is no longer necessary for `EtcdStore`. There are two pending github issues that need this change (see attached tasks) Reviewed By: isunjin Differential Revision: D19511842 fbshipit-source-id: 13fff370668d110fedc436684226e41ad3e69df9
- Loading branch information