Skip to content

Move allgather_coalesced implementation from Python to C++ #29059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

agolynski
Copy link
Contributor

@agolynski agolynski commented Nov 1, 2019

Summary:
Pull Request resolved: #29059

Resubmit of reverted PR #28857.

Differential Revision: D18277097

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18277097

@agolynski
Copy link
Contributor Author

#29059 caused a broken build due to unimplemented function in MPI backend. Fixed here.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18277097

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18277097

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18277097

@pietern pietern changed the title Moving python allgather_coalesced impl from Py to C. (#28857) Moving allgather_coalesced implementation from Python to C++ Nov 4, 2019
@pietern pietern changed the title Moving allgather_coalesced implementation from Python to C++ Move allgather_coalesced implementation from Python to C++ Nov 4, 2019
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have checked CI before approving #28857.

It's all green now, so it should be good to go.

Summary:
Pull Request resolved: pytorch#29059
This is a resubmit of reverted diff D18209289 ( PR pytorch#28857 ).

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: pietern

Differential Revision: D18277097

fbshipit-source-id: 3e16c4c5f71e5c051ffef280e021bd253caf127c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18277097

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 23695ab.

@@ -1309,31 +1309,37 @@ def test_allgather_stress_cuda(self):
def test_allgather_coalesced_checks(self):
store = c10d.FileStore(self.file_name, self.world_size)
pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, self.opts())
dummy_input = [torch.Tensor([1])]
dummy_input = [torch.zeros([1], dtype=torch.float32)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry -- why are these not translated exactly?

torch.Tensor([1]) is torch.ones([1]), not zeros, right?

also same with the line below, why did that change from -1 to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only tests error handling, so they underlying values here should not be important (all_gather_coalesced never copies anything in this function). I am happy to change it back if you prefer

@@ -203,6 +203,20 @@ inline void assertCPU(
}
}

inline void assertSameDevice(
std::function<void(const std::string&)> fn,
const at::ArrayRef<at::Tensor>& tensors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have a TensorList for this? (Also I wouldn't expect const reference to it, it's trivial to copy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, a good point. I was trying to be consistent with the other functions in this module mostly use const reference to ArrayRef, and not TensorList (TensorList = ArrayRef).
Actually, I just need to verify tensors in a vector, so I might just accept a const ref to a vector.
Would you prefer that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants