Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#141) #53172

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Collaborator

Summary:
Pull Request resolved: pytorch/elastic#141

Upstreams two modules to torch:

  1. torchelastic.rendezvous
  2. torchelastic.utils

These modules were chosen as [1/n] since they are the leaf modules in torchelastic.

NOTE: This diff is NOT ready to land - I still have a few dependencies to test/codemod. torch only has a single buck target: caffe2:torch which makes this a bit tricky.

Test Plan:

buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...

+ Sandcastle

Differential Revision: D26718746

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 3, 2021

💊 CI failures summary and remediations

As of commit 1aac1f5 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Mar 05 01:23:02 urllib.error.HTTPError: HTTP Error 403: Forbidden
Mar 05 01:23:02   File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
Mar 05 01:23:02     response = meth(req, response)
Mar 05 01:23:02   File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
Mar 05 01:23:02     'http', request, response, code, msg, hdrs)
Mar 05 01:23:02   File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
Mar 05 01:23:02     return self._call_chain(*args)
Mar 05 01:23:02   File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
Mar 05 01:23:02     result = func(*args)
Mar 05 01:23:02   File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
Mar 05 01:23:02     raise HTTPError(req.full_url, code, msg, hdrs, fp)
Mar 05 01:23:02 urllib.error.HTTPError: HTTP Error 403: Forbidden
Mar 05 01:23:02 
Mar 05 01:23:02 ----------------------------------------------------------------------
Mar 05 01:23:02 Ran 1 test in 0.414s
Mar 05 01:23:02 
Mar 05 01:23:02 FAILED (errors=1)
Mar 05 01:23:03 
0it [00:00, ?it/s]
Mar 05 01:23:03 + cleanup
Mar 05 01:23:03 + retcode=1
Mar 05 01:23:03 + set +x
Mar 05 01:23:03 =================== sccache compilation log ===================

ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue fb-exported labels Mar 3, 2021
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D26718746

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D26718746

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D26718746

…distributed/rendezvous (pytorch#53172)

Summary:
Pull Request resolved: pytorch#53172

Pull Request resolved: pytorch/elastic#141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Differential Revision: D26718746

fbshipit-source-id: 7f832d9e1496bf3b15945cc0a1d050dda96aac8d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D26718746

from torch.distributed.elastic.rendezvous.etcd_server import EtcdServer


@unittest.skipIf(os.getenv("CIRCLECI"), "T85992919 temporarily disabling in circle ci")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this leaking internal task id to OSS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create an Github issue to track this?

is_server=True, server_addr=socket.gethostname(), world_size=2
)

def test_create_store_multi(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit and can be done in followup PR: The rest of distributed tests uses a MultiProcessTestCase as base class. Does it make sense if we use that base test class here?

class MultiProcessTestCase(TestCase):

Copy link
Collaborator Author

@kiukchung kiukchung Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @mrshenli ! With the awesome changes that @H-Huang and @cbalioglu have added to c10d::Store, the utils/distributed.py file can go away. I'm planning to do this when I upstream the torchelastic.agent module since that's the one that is using these util methods.

For general long term fixes that come up, I'll create github issues to track them. Feel free to point them out during the upstream!

from base64 import b64decode, b64encode
from typing import Optional

import etcd # type: ignore[import]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @malfet @seemethere is this dependency OK?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've talked to @seemethere, since this is a runtime dependency - for now we won't add the dependency to torch, but we will only add it to the test Docker image (so that we can reenable these tests). Users will have to pip install etcd_client if they want to use EtcdRendezvous but we will be adding a StandaloneRendezvous with no etcd runtime dependency.

Looks like we are doing something similar to numpy (users will have to manually install that dependency if they want to use an implementation in torch that depends on numpy) so we have precedence.

log = logging.getLogger(__name__)


def find_free_port():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can be done in follow up PR), there is another find_free_port implementation in common_utils.py, do we need to consolidate?

def find_free_port():

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes I did consider it. Thing is that the one in common_utils is under the testing/_internal directory and etcd_server is technically a part of the lib. I'll create a github issue for extracting this function into a proper util-lib module and having both use that.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directory structure LGTM! Only stamping for the folder structure. I didn't review rendezvous implementations.

Looks like the tests are not yet linked to CI and docs are not yet linked to docgen. I guess those will be done in followup PRs?

server.stop()

Args:
etcd_binary_path: path of etcd server binary (see above for fallback path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other API docs usually also specific the arg type, i.e., (str)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, docs will be reviewed as part of phase III (in mid April) so we'll make sure there is the same "look-and-feel" with the rest of torch.

from torch.utils.data.distributed import DistributedSampler


class ElasticDistributedSampler(DistributedSampler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to consolidate all distributed samplers into the same folder (torch.utils or torch.distributed)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we should, it'll be done after the upstream is done and when we start "blending" the torchelastic modules to torch. The same needs to be done to multiprocessing, logging, events, utils.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 402a793.

facebook-github-bot pushed a commit that referenced this pull request Mar 5, 2021
…distributed/rendezvous (#53172)

Summary:
Pull Request resolved: #53172

Pull Request resolved: pytorch/elastic#141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
…distributed/rendezvous (pytorch#53172)

Summary:
Pull Request resolved: pytorch#53172

Pull Request resolved: pytorch/elastic#141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
fotstrt pushed a commit to eth-easl/elastic that referenced this pull request Feb 17, 2022
Summary:
Pull Request resolved: pytorch/pytorch#53172

Pull Request resolved: pytorch#141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants