-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Implement JIT serialization of ProcessGroup #48544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8a04f6c
to
125b839
Compare
💊 CI failures summary and remediationsAs of commit f6ae7f0 (more details on the Dr. CI page):
3 jobs timed out:
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet: ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 255 times. |
6b79608
to
a6be99d
Compare
torch/lib/c10d/frontend.cpp
Outdated
prefix_store, rank, world_size, options); | ||
#endif | ||
#else | ||
TORCH_CHECK(false, "Attempting to create GLOO-based process group while GLOO is either not enabled or built") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: AT_ERROR
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
torch/lib/c10d/frontend.cpp
Outdated
} else { | ||
// TODO: discuss to figure out how to extend this to third party backends? | ||
return pg; | ||
TORCH_CHECK(false, "Unsupported backend type: ", backend); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dittos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/distributed/test_jit_c10d.py
Outdated
self.pg = pg_nccl | ||
|
||
now = datetime.now(timezone.utc).timestamp() | ||
name = "nccl_process_group_as_module_member_%d" % now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are using timestamp for the name? is this because there might be multiple calls to the test and it will error if we use the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, when we run test multiple times, they may conflict.
torch/lib/c10d/frontend.cpp
Outdated
error << ", "; | ||
} | ||
error << "}"; | ||
TORCH_CHECK(false, error.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto at_error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for all of TORCH_CHECK(false
error << name; | ||
error << " , instead we have "; | ||
error << pg_names_.size() << " process groups: {"; | ||
for (const auto& pg : pg_names_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like if we are switch envs, and there's no pg created, user need to manually create pg that matches the name that they registered, shall we provide instructions to tell the user to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we should, in fact we should write this logic for our current user.
torch/lib/c10d/frontend.cpp
Outdated
[&](const std::pair<c10::intrusive_ptr<ProcessGroup>, std::string>& | ||
pg_name) { return pg_name.second == name; }); | ||
|
||
TORCH_CHECK(it == pg_names_.end(), "Requested name already exists: ", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the requested processGroupName already exists, we can just warn the user, and use that existed pg when needed? that way user won't need to create it again and again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to only error out when found instance is not same as argument process_group
2d14f7e
to
df0952c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
df0952c
to
e71387a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
e71387a
to
ea888ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
ea888ed
to
085680e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
085680e
to
da1df70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks! only one minor suggestion.
dont_parse_list = [ | ||
("_TorchScriptTesting.*", datetime.date(2099, 9, 17)), | ||
("test_backend", datetime.date(2099, 9, 17)), | ||
("c10d.frontend", datetime.date(2020, 12, 30)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, I made the mistake of putting frontend into c10d
namespace rather than dist_c10d
, which all of our other torchbinds are in. So I changed the namespace in this diff, resulting in a BC-breaking change.
auto base_process_group = | ||
::c10d::DistributedC10d::get()->getProcessGroupByName( | ||
process_group_name); | ||
TORCH_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we already check this inside getProcessGroupByName
, so this check is likely a code that can't be reached, maybe merge this two check together?
da1df70
to
ec15787
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
ec15787
to
f6ae7f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@gmagogsfm merged this pull request in a3298c2. |
This diff enables JIT serialization of
ProcessGroup
, including both baseProcessGroup
class and derived classes likeProcessGroupNCCL
.If a
ProcessGroup
is created via high-level APIs likedist_c10d.frontend().new_process_group_helper()
, they are automatically serializable. If aProcessGroup
is created via its derived class TorchBind APIs likedist_c10d.ProcessGroupNCCL()
, then it has to be given a name and registered withdist_c10d.frontend().register_process_group_name
to be uniquely identifiable and serializable.test_jit_c10d.py
wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch)