Implement JIT serialization of ProcessGroup #48544

gmagogsfm · 2020-11-28T22:07:32Z

This diff enables JIT serialization of ProcessGroup, including both base ProcessGroup class and derived classes like ProcessGroupNCCL.

If a ProcessGroup is created via high-level APIs like dist_c10d.frontend().new_process_group_helper(), they are automatically serializable. If a ProcessGroup is created via its derived class TorchBind APIs like dist_c10d.ProcessGroupNCCL(), then it has to be given a name and registered with dist_c10d.frontend().register_process_group_name to be uniquely identifiable and serializable.

Fixed a minor bug in new dist_c10d frontend which fails to check whether a process group is used or not
Fixed an issue where test_jit_c10d.py wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch)

dr-ci · 2020-11-29T02:25:56Z

💊 CI failures summary and remediations

As of commit f6ae7f0 (more details on the Dr. CI page):

1/5 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
4/5 broken upstream at merge base 4eb4db7 since Dec 04

3 jobs timed out:

pytorch_windows_vs2019_py36_cpu_test2
pytorch_windows_vs2019_py36_cuda10.1_test2
pytorch_windows_vs2019_py36_cuda11.1_test2

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test since Dec 04
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.9-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 255 times.

wanchaol · 2020-12-03T19:44:28Z

torch/lib/c10d/frontend.cpp

          prefix_store, rank, world_size, options);
-#endif
+#else
+    TORCH_CHECK(false, "Attempting to create GLOO-based process group while GLOO is either not enabled or built")


nit: AT_ERROR instead?

wanchaol · 2020-12-03T19:44:44Z

torch/lib/c10d/frontend.cpp

    } else {
      // TODO: discuss to figure out how to extend this to third party backends?
-      return pg;
+      TORCH_CHECK(false, "Unsupported backend type: ", backend);


wanchaol · 2020-12-03T21:45:41Z

test/distributed/test_jit_c10d.py

+                self.pg = pg_nccl
+
+                now = datetime.now(timezone.utc).timestamp()
+                name = "nccl_process_group_as_module_member_%d" % now


why we are using timestamp for the name? is this because there might be multiple calls to the test and it will error if we use the same name?

Yep, when we run test multiple times, they may conflict.

wanchaol · 2020-12-03T22:15:48Z

torch/lib/c10d/frontend.cpp

+      error << ", ";
+    }
+    error << "}";
+    TORCH_CHECK(false, error.str());


ditto at_error

Done for all of TORCH_CHECK(false

wanchaol · 2020-12-03T22:16:00Z

torch/lib/c10d/frontend.cpp

+    error << name;
+    error << " , instead we have ";
+    error << pg_names_.size() << " process groups: {";
+    for (const auto& pg : pg_names_) {


it seems like if we are switch envs, and there's no pg created, user need to manually create pg that matches the name that they registered, shall we provide instructions to tell the user to do so?

Yep we should, in fact we should write this logic for our current user.

wanchaol · 2020-12-03T22:16:57Z

torch/lib/c10d/frontend.cpp

+      [&](const std::pair<c10::intrusive_ptr<ProcessGroup>, std::string>&
+              pg_name) { return pg_name.second == name; });
+
+  TORCH_CHECK(it == pg_names_.end(), "Requested name already exists: ", name);


I think if the requested processGroupName already exists, we can just warn the user, and use that existed pg when needed? that way user won't need to create it again and again

Changed it to only error out when found instance is not same as argument process_group

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

wanchaol

looks good to me, thanks! only one minor suggestion.

wanchaol · 2020-12-04T18:51:34Z

test/backward_compatibility/check_backward_compatibility.py

 dont_parse_list = [
    ("_TorchScriptTesting.*", datetime.date(2099, 9, 17)),
    ("test_backend", datetime.date(2099, 9, 17)),
+    ("c10d.frontend", datetime.date(2020, 12, 30)),


curious why we need this?

Previously, I made the mistake of putting frontend into c10d namespace rather than dist_c10d, which all of our other torchbinds are in. So I changed the namespace in this diff, resulting in a BC-breaking change.

wanchaol · 2020-12-04T18:58:01Z

torch/csrc/distributed/c10d/init.cpp

+              auto base_process_group =
+                  ::c10d::DistributedC10d::get()->getProcessGroupByName(
+                      process_group_name);
+              TORCH_CHECK(


seems like we already check this inside getProcessGroupByName, so this check is likely a code that can't be reached, maybe merge this two check together?

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ghstack-source-id: 32b2bc5 Pull Request resolved: #48333

facebook-github-bot

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-12-05T03:18:38Z

@gmagogsfm merged this pull request in a3298c2.

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 28, 2020

gmagogsfm force-pushed the ci-all/ycao2 branch from 8a04f6c to 125b839 Compare November 29, 2020 01:49

gmagogsfm force-pushed the ci-all/ycao2 branch 25 times, most recently from 6b79608 to a6be99d Compare December 3, 2020 01:25

gmagogsfm marked this pull request as ready for review December 3, 2020 07:44

gmagogsfm requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners December 3, 2020 07:44

gmagogsfm requested a review from wanchaol December 3, 2020 07:45

wanchaol reviewed Dec 3, 2020

View reviewed changes

gmagogsfm force-pushed the ci-all/ycao2 branch from 2d14f7e to df0952c Compare December 4, 2020 01:39

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

gmagogsfm force-pushed the ci-all/ycao2 branch from df0952c to e71387a Compare December 4, 2020 02:02

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

gmagogsfm force-pushed the ci-all/ycao2 branch from e71387a to ea888ed Compare December 4, 2020 03:36

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

gmagogsfm force-pushed the ci-all/ycao2 branch from ea888ed to 085680e Compare December 4, 2020 04:15

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

gmagogsfm force-pushed the ci-all/ycao2 branch from 085680e to da1df70 Compare December 4, 2020 07:24

wanchaol approved these changes Dec 4, 2020

View reviewed changes

gmagogsfm force-pushed the ci-all/ycao2 branch from da1df70 to ec15787 Compare December 4, 2020 21:31

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

Implement JIT serialization of ProcessGroup

f6ae7f0

ghstack-source-id: 32b2bc5 Pull Request resolved: #48333

gmagogsfm force-pushed the ci-all/ycao2 branch from ec15787 to f6ae7f0 Compare December 4, 2020 22:12

facebook-github-bot reviewed Dec 4, 2020

View reviewed changes

facebook-github-bot closed this in a3298c2 Dec 5, 2020

facebook-github-bot added the Merged label Dec 5, 2020

facebook-github-bot deleted the ci-all/ycao2 branch January 27, 2021 18:26

Implement JIT serialization of ProcessGroup #48544

Implement JIT serialization of ProcessGroup #48544

Uh oh!

Conversation

gmagogsfm commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Nov 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

ci.pytorch.org: 1 failed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Dec 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gmagogsfm commented Nov 28, 2020 •

edited

Loading

dr-ci bot commented Nov 29, 2020 •

edited

Loading

wanchaol Dec 3, 2020 •

edited

Loading