Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Commit

Permalink
Use EtcdStore rather than TCPStore when using etcd_rdzv (#34)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #34

Background:

The rdzv interface returns a store as part of the `next()` API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of the `c10d::Store` interface because no trampoline class existed in the pybind definition.

TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in `serial`mode. This is no longer necessary for `EtcdStore`.

There are two pending github issues that need this change (see attached tasks)

Reviewed By: isunjin

Differential Revision: D19511842

fbshipit-source-id: 13fff370668d110fedc436684226e41ad3e69df9
  • Loading branch information
Kiuk Chung authored and facebook-github-bot committed Jan 22, 2020
1 parent 34b43b7 commit 393a26c
Showing 1 changed file with 2 additions and 51 deletions.
53 changes: 2 additions & 51 deletions torchelastic/rendezvous/etcd_rendezvous.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,23 +94,8 @@ def __del__(self):
def next_rendezvous(self):
rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()

# TODO: https://github.com/pytorch/elastic/issues/11
# make EtcdStore the default and remove TCPStore code
# Setup a c10d store for this specific rendezvous version,
# by piggybacking on the etcd handler used during rendezvous.
# Switch back to EtcdStore once issue with
# pybind11-trampoline for c10d Store is resolved.
# store = self._rdzv_impl.setup_kv_store(rdzv_version)
# path once the pybind11-trampoline fix for c10d::Store is included in
# the next pytorch release. Then, remove this hack.
import torchelastic.rendezvous # noqa

if "_TORCHELASTIC_USE_ETCDSTORE" in torchelastic.rendezvous.__dict__:
log.info("Using EtcdStore for c10d::Store implementation")
store = self._rdzv_impl.setup_kv_store(rdzv_version)
else:
log.info("Using TCPStore for c10d::Store implementation")
store = setup_tcpstore(rank, world_size, rdzv_version, self._rdzv_impl)
log.info("Creating EtcdStore as the c10d::Store implementation")
store = self._rdzv_impl.setup_kv_store(rdzv_version)

return store, rank, world_size

Expand Down Expand Up @@ -1065,40 +1050,6 @@ def _get_socket_with_port():
raise RuntimeError("Failed to create a socket")


# Helper function to setup a TCPStore-based c10d::Store implementation.
def setup_tcpstore(rank, world_size, rdzv_version, rdzv_impl):
if rank == 0:
import socket
from contextlib import closing

# FIXME: ideally, TCPStore should have an API that
# accepts a pre-constructed socket.
with closing(_get_socket_with_port()) as sock:
host = socket.gethostname()
port = sock.getsockname()[1]

rdzv_impl.store_extra_data(
rdzv_version, key="tcpstore_server", value="{}:{}".format(host, port)
)

log.info(f"Setting up TCPStore server on {host}:{port}")
start_daemon = True
sock.close() # FIXME: get rid of race-condition by improving TCPStore API
store = TCPStore(host, port, world_size, start_daemon)
log.info(f"TCPStore server initialized on {host}:{port}")
else:
hostport = rdzv_impl.load_extra_data(rdzv_version, key="tcpstore_server")
log.info(f"Rank {rank} will conenct to TCPStore server at {hostport}")

import re

host, port = re.match(r"(.+):(\d+)$", hostport).groups()
start_daemon = False
store = TCPStore(host, int(port), world_size, start_daemon)

return store


# Helper for _etcd_rendezvous_handler(url)
def _parse_etcd_client_params(params):
kwargs = {}
Expand Down

0 comments on commit 393a26c

Please sign in to comment.