Skip to content

Conversation

gmagogsfm
Copy link
Contributor

@gmagogsfm gmagogsfm commented Nov 28, 2020

This diff enables JIT serialization of ProcessGroup, including both base ProcessGroup class and derived classes like ProcessGroupNCCL.

If a ProcessGroup is created via high-level APIs like dist_c10d.frontend().new_process_group_helper(), they are automatically serializable. If a ProcessGroup is created via its derived class TorchBind APIs like dist_c10d.ProcessGroupNCCL(), then it has to be given a name and registered with dist_c10d.frontend().register_process_group_name to be uniquely identifiable and serializable.

  • Fixed a minor bug in new dist_c10d frontend which fails to check whether a process group is used or not
  • Fixed an issue where test_jit_c10d.py wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch)

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 28, 2020
@dr-ci
Copy link

dr-ci bot commented Nov 29, 2020

💊 CI failures summary and remediations

As of commit f6ae7f0 (more details on the Dr. CI page):


  • 1/5 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 4/5 broken upstream at merge base 4eb4db7 since Dec 04

3 jobs timed out:

  • pytorch_windows_vs2019_py36_cpu_test2
  • pytorch_windows_vs2019_py36_cuda10.1_test2
  • pytorch_windows_vs2019_py36_cuda11.1_test2

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 255 times.

@gmagogsfm gmagogsfm force-pushed the ci-all/ycao2 branch 25 times, most recently from 6b79608 to a6be99d Compare December 3, 2020 01:25
@gmagogsfm gmagogsfm marked this pull request as ready for review December 3, 2020 07:44
@gmagogsfm gmagogsfm requested a review from wanchaol December 3, 2020 07:45
prefix_store, rank, world_size, options);
#endif
#else
TORCH_CHECK(false, "Attempting to create GLOO-based process group while GLOO is either not enabled or built")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: AT_ERROR instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else {
// TODO: discuss to figure out how to extend this to third party backends?
return pg;
TORCH_CHECK(false, "Unsupported backend type: ", backend);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dittos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.pg = pg_nccl

now = datetime.now(timezone.utc).timestamp()
name = "nccl_process_group_as_module_member_%d" % now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are using timestamp for the name? is this because there might be multiple calls to the test and it will error if we use the same name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, when we run test multiple times, they may conflict.

error << ", ";
}
error << "}";
TORCH_CHECK(false, error.str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto at_error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for all of TORCH_CHECK(false

error << name;
error << " , instead we have ";
error << pg_names_.size() << " process groups: {";
for (const auto& pg : pg_names_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like if we are switch envs, and there's no pg created, user need to manually create pg that matches the name that they registered, shall we provide instructions to tell the user to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we should, in fact we should write this logic for our current user.

[&](const std::pair<c10::intrusive_ptr<ProcessGroup>, std::string>&
pg_name) { return pg_name.second == name; });

TORCH_CHECK(it == pg_names_.end(), "Requested name already exists: ", name);
Copy link
Collaborator

@wanchaol wanchaol Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the requested processGroupName already exists, we can just warn the user, and use that existed pg when needed? that way user won't need to create it again and again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to only error out when found instance is not same as argument process_group

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks! only one minor suggestion.

dont_parse_list = [
("_TorchScriptTesting.*", datetime.date(2099, 9, 17)),
("test_backend", datetime.date(2099, 9, 17)),
("c10d.frontend", datetime.date(2020, 12, 30)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, I made the mistake of putting frontend into c10d namespace rather than dist_c10d, which all of our other torchbinds are in. So I changed the namespace in this diff, resulting in a BC-breaking change.

auto base_process_group =
::c10d::DistributedC10d::get()->getProcessGroupByName(
process_group_name);
TORCH_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we already check this inside getProcessGroupByName, so this check is likely a code that can't be reached, maybe merge this two check together?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ghstack-source-id: 32b2bc5
Pull Request resolved: #48333
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gmagogsfm merged this pull request in a3298c2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants