Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Use EtcdStore rather than TCPStore when using etcd_rdzv #34

Closed
wants to merge 2 commits into from

Commits on Jan 22, 2020

  1. Move get_rank() to torchelastic.distributed.utils

    Differential Revision: D19507117
    
    fbshipit-source-id: a02c267cc9ffacec7bcb9bac4608e04a255f6540
    Kiuk Chung authored and facebook-github-bot committed Jan 22, 2020
    Configuration menu
    Copy the full SHA
    5f2937c View commit details
    Browse the repository at this point in the history
  2. Use EtcdStore rather than TCPStore when using etcd_rdzv

    Summary:
    Background:
    
    The rdzv interface returns a store as part of the `next()` API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of the `c10d::Store` interface because no trampoline class existed in the pybind definition.
    
    TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in `serial`mode. This is no longer necessary for `EtcdStore`.
    
    There are two pending github issues that need this change (see attached tasks)
    
    Differential Revision: D19511842
    
    fbshipit-source-id: bccec3c9663a6dbb690c1d7f610f9f546736128d
    Kiuk Chung authored and facebook-github-bot committed Jan 22, 2020
    Configuration menu
    Copy the full SHA
    d25d88a View commit details
    Browse the repository at this point in the history