-
Notifications
You must be signed in to change notification settings - Fork 25k
Move allgather_coalesced implementation from Python to C++ #29059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D18277097 |
#29059 caused a broken build due to unimplemented function in MPI backend. Fixed here. |
36b882e
to
efd623b
Compare
This pull request was exported from Phabricator. Differential Revision: D18277097 |
efd623b
to
0542a6c
Compare
This pull request was exported from Phabricator. Differential Revision: D18277097 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D18277097 |
0542a6c
to
b21acc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have checked CI before approving #28857.
It's all green now, so it should be good to go.
Summary: Pull Request resolved: pytorch#29059 This is a resubmit of reverted diff D18209289 ( PR pytorch#28857 ). Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: pietern Differential Revision: D18277097 fbshipit-source-id: 3e16c4c5f71e5c051ffef280e021bd253caf127c
b21acc4
to
557c40b
Compare
This pull request was exported from Phabricator. Differential Revision: D18277097 |
This pull request has been merged in 23695ab. |
@@ -1309,31 +1309,37 @@ def test_allgather_stress_cuda(self): | |||
def test_allgather_coalesced_checks(self): | |||
store = c10d.FileStore(self.file_name, self.world_size) | |||
pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, self.opts()) | |||
dummy_input = [torch.Tensor([1])] | |||
dummy_input = [torch.zeros([1], dtype=torch.float32)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry -- why are these not translated exactly?
torch.Tensor([1])
is torch.ones([1])
, not zeros, right?
also same with the line below, why did that change from -1 to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only tests error handling, so they underlying values here should not be important (all_gather_coalesced never copies anything in this function). I am happy to change it back if you prefer
@@ -203,6 +203,20 @@ inline void assertCPU( | |||
} | |||
} | |||
|
|||
inline void assertSameDevice( | |||
std::function<void(const std::string&)> fn, | |||
const at::ArrayRef<at::Tensor>& tensors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have a TensorList for this? (Also I wouldn't expect const reference to it, it's trivial to copy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, a good point. I was trying to be consistent with the other functions in this module mostly use const reference to ArrayRef, and not TensorList (TensorList = ArrayRef).
Actually, I just need to verify tensors in a vector, so I might just accept a const ref to a vector.
Would you prefer that?
Summary:
Pull Request resolved: #29059
Resubmit of reverted PR #28857.
Differential Revision: D18277097