-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#141) #53172
Conversation
💊 CI failures summary and remediationsAs of commit 1aac1f5 (more details on the Dr. CI page):
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm: pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️
|
This pull request was exported from Phabricator. Differential Revision: D26718746 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D26718746 |
6449cf3
to
baf12ff
Compare
This pull request was exported from Phabricator. Differential Revision: D26718746 |
baf12ff
to
ee0cbd1
Compare
…distributed/rendezvous (pytorch#53172) Summary: Pull Request resolved: pytorch#53172 Pull Request resolved: pytorch/elastic#141 Upstreams two modules to torch: 1. `torchelastic.rendezvous` 2. `torchelastic.utils` These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic. ==== NOTES: ==== 1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919. 2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons: 1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move 1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579. Test Plan: ``` buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/... buck test mode/dev-nosan //pytorch/elastic/torchelastic/... ``` \+ Sandcastle Differential Revision: D26718746 fbshipit-source-id: 7f832d9e1496bf3b15945cc0a1d050dda96aac8d
This pull request was exported from Phabricator. Differential Revision: D26718746 |
ee0cbd1
to
1aac1f5
Compare
from torch.distributed.elastic.rendezvous.etcd_server import EtcdServer | ||
|
||
|
||
@unittest.skipIf(os.getenv("CIRCLECI"), "T85992919 temporarily disabling in circle ci") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this leaking internal task id to OSS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can create an Github issue to track this?
is_server=True, server_addr=socket.gethostname(), world_size=2 | ||
) | ||
|
||
def test_create_store_multi(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit and can be done in followup PR: The rest of distributed tests uses a MultiProcessTestCase
as base class. Does it make sense if we use that base test class here?
class MultiProcessTestCase(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out @mrshenli ! With the awesome changes that @H-Huang and @cbalioglu have added to c10d::Store
, the utils/distributed.py
file can go away. I'm planning to do this when I upstream the torchelastic.agent
module since that's the one that is using these util methods.
For general long term fixes that come up, I'll create github issues to track them. Feel free to point them out during the upstream!
from base64 import b64decode, b64encode | ||
from typing import Optional | ||
|
||
import etcd # type: ignore[import] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @malfet @seemethere is this dependency OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've talked to @seemethere, since this is a runtime dependency - for now we won't add the dependency to torch, but we will only add it to the test Docker image (so that we can reenable these tests). Users will have to pip install etcd_client
if they want to use EtcdRendezvous
but we will be adding a StandaloneRendezvous
with no etcd runtime dependency.
Looks like we are doing something similar to numpy
(users will have to manually install that dependency if they want to use an implementation in torch that depends on numpy) so we have precedence.
log = logging.getLogger(__name__) | ||
|
||
|
||
def find_free_port(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(can be done in follow up PR), there is another find_free_port
implementation in common_utils.py
, do we need to consolidate?
pytorch/torch/testing/_internal/common_utils.py
Line 1469 in ac668c5
def find_free_port(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes I did consider it. Thing is that the one in common_utils
is under the testing/_internal
directory and etcd_server
is technically a part of the lib. I'll create a github issue for extracting this function into a proper util-lib module and having both use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directory structure LGTM! Only stamping for the folder structure. I didn't review rendezvous implementations.
Looks like the tests are not yet linked to CI and docs are not yet linked to docgen. I guess those will be done in followup PRs?
server.stop() | ||
|
||
Args: | ||
etcd_binary_path: path of etcd server binary (see above for fallback path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other API docs usually also specific the arg type, i.e., (str)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, docs will be reviewed as part of phase III (in mid April) so we'll make sure there is the same "look-and-feel" with the rest of torch.
from torch.utils.data.distributed import DistributedSampler | ||
|
||
|
||
class ElasticDistributedSampler(DistributedSampler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to consolidate all distributed samplers into the same folder (torch.utils or torch.distributed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we should, it'll be done after the upstream is done and when we start "blending" the torchelastic modules to torch. The same needs to be done to multiprocessing
, logging
, events
, utils
.
This pull request has been merged in 402a793. |
…distributed/rendezvous (#53172) Summary: Pull Request resolved: #53172 Pull Request resolved: pytorch/elastic#141 Upstreams two modules to torch: 1. `torchelastic.rendezvous` 2. `torchelastic.utils` These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic. ==== NOTES: ==== 1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919. 2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons: 1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move 1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579. Test Plan: ``` buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/... buck test mode/dev-nosan //pytorch/elastic/torchelastic/... ``` \+ Sandcastle Reviewed By: H-Huang Differential Revision: D26718746 fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
…distributed/rendezvous (pytorch#53172) Summary: Pull Request resolved: pytorch#53172 Pull Request resolved: pytorch/elastic#141 Upstreams two modules to torch: 1. `torchelastic.rendezvous` 2. `torchelastic.utils` These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic. ==== NOTES: ==== 1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919. 2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons: 1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move 1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579. Test Plan: ``` buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/... buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/... buck test mode/dev-nosan //pytorch/elastic/torchelastic/... ``` \+ Sandcastle Reviewed By: H-Huang Differential Revision: D26718746 fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
Summary: Pull Request resolved: pytorch/pytorch#53172 Pull Request resolved: pytorch#141 Upstreams two modules to torch: 1. `torchelastic.rendezvous` 2. `torchelastic.utils` These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic. ==== NOTES: ==== 1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919. 2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons: 1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move 1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579. Reviewed By: H-Huang Differential Revision: D26718746 fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
Summary:
Pull Request resolved: pytorch/elastic#141
Upstreams two modules to torch:
torchelastic.rendezvous
torchelastic.utils
These modules were chosen as
[1/n]
since they are the leaf modules in torchelastic.NOTE: This diff is NOT ready to land - I still have a few dependencies to test/codemod. torch only has a single buck target:
caffe2:torch
which makes this a bit tricky.Test Plan:
+ Sandcastle
Differential Revision: D26718746