-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Switch PG::Work to Future in default_comm_hooks.cpp #59398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit ce80301 (more details on the Dr. CI page):
🕵️ 4 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Job | Step | Action |
---|---|---|
Download PyTorch Test Reports | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
auto allreduce_work = state_->allreduce(tensors); | ||
auto allreduce_fut = state_->allreduce(tensors)->getFuture(); | ||
|
||
// FIXME Access the result through the Future passed as argument, instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the FIXME comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// capturing the Work. | ||
auto div_by_process_group_size = [allreduce_work, | ||
auto div_by_process_group_size = [allreduce_fut, | ||
this](c10::ivalue::Future& /* unused */) { | ||
auto tensor = allreduce_work->result()[0] / state_->getSize(); | ||
|
||
auto result = allreduce_fut->value(); | ||
TORCH_INTERNAL_ASSERT(result.isTensorList(), | ||
"ProcessGroup::allreduce should return TensorList"); | ||
auto tensor = result.toTensorVector()[0] / state_->getSize(); | ||
return c10::IValue(tensor); | ||
}; | ||
|
||
auto fut = allreduce_work->getFuture(); | ||
return fut->then(div_by_process_group_size, fut->elementType()); | ||
return allreduce_fut->then(div_by_process_group_size, allreduce_fut->elementType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code introduces a (potential) memory leak: by adding a callback, you end up storing on allreduce_fut
a lambda whose closure contains an owning pointer to allreduce_fut
itself. This creates a reference cycle, which means that if for some reason the future is never completed it will never be "garbage collected" because its refcount will never reach 0.
This is precisely the problem that was being addressed in the FIXME that you removed: instead of capturing allreduce_fut
, we should use the argument that is being passed to the lambda, which also points to allreduce_fut
, but which doesn't cause the reference cycle.
In other words, please do this:
auto div_by_process_group_size = [this](c10::ivalue::Future& allreduce_fut) {
auto result = allreduce_fut.value();
...
};
return allreduce_fut->then(div_by_process_group_size, ...);
The same applies to the other FIXME just below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To go the extra mile, it might also be good to avoid capturing this
in the lambda, because it boils down to capturing a raw pointer, which means that there is no guarantee that the pointed-to object will still be alive once the callback fires.
The only reason for capturing this
is to access state_->getSize()
(I believe), hence what we could do instead is capturing that directly by value:
auto div_by_process_group_size = [size{state_->getSize()}](c10::ivalue::Future& allreduce_fut) { ... };
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason for capturing
this
is to accessstate_->getSize()
(I believe), hence what we could do instead is capturing that directly by value:
thanks, capturing 'this' here is not a good idea indeed, this can easily lead to a crash when Future is not immediately returned (e.g. GLOO and MPI backends)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code introduces a (potential) memory leak: by adding a callback, you end up storing on
allreduce_fut
a lambda whose closure contains an owning pointer toallreduce_fut
itself. This creates a reference cycle, which means that if for some reason the future is never completed it will never be "garbage collected" because its refcount will never reach 0.This is precisely the problem that was being addressed in the FIXME that you removed: instead of capturing
allreduce_fut
, we should use the argument that is being passed to the lambda, which also points toallreduce_fut
, but which doesn't cause the reference cycle.In other words, please do this:
auto div_by_process_group_size = [this](c10::ivalue::Future& allreduce_fut) { auto result = allreduce_fut.value(); ... }; return allreduce_fut->then(div_by_process_group_size, ...);
The same applies to the other FIXME just below.
Done.
Talked offline: this is a defensive measure as if Future is not returned, typically behavior is not defined if we depend on result of that future. Memory leak will be the least of problems in this case.
Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]
Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]
Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]
@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]
@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@agolynski merged this pull request in 1183fa3. |
Summary: Pull Request resolved: pytorch#59398 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28876182 Pulled By: agolynski fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e
Stack from ghstack:
Differential Revision: D28876182