Add all_gather_out and reduce_scatter_out #3359

JackCaoG · 2022-02-10T05:28:57Z

This is to enable us to support torch distributed api.

torch_xla/core/xla_model.py

hjm-aws · 2022-02-10T06:27:36Z

torch_xla/csrc/tensor.h

      double scale, int64_t scatter_dim, int64_t shard_count,
      std::vector<std::vector<int64_t>> groups);

+  static ir::Value reduce_scatter_out(XLATensor& output, const XLATensor& input,


Do you need to follow Google C++ style here? https://google.github.io/styleguide/cppguide.html#Inputs_and_Outputs requires you to put output params at the end of the list.

Thanks for pointing it out! I was following the same pattern on this file. We put out as the first tensor argument

https://github.com/pytorch/xla/blob/master/torch_xla/csrc/tensor_methods.cpp#L907

JackCaoG · 2022-02-10T22:30:40Z

@hjm-aws I will merge this pr once all test passed.

miladm

Thanks @JackCaoG!
LGTM. Some nits are below.

miladm · 2022-02-10T23:27:20Z

torch_xla/csrc/init_python_bindings.cpp

+    const std::shared_ptr<ir::Value>& token, double scale, int64_t scatter_dim,
+    int64_t shard_count,
+    const std::vector<std::vector<int64_t>>& replica_groups) {
+  XLATensor out = bridge::GetXlaTensor(output);


nit: is there a reason we don't do this line inside of the reduce_scatter_out call? (like input)

reduce_scatter_out takes out as a reference(instead of constant reference), so it has to be assigned outside of function calling line.

I came here for inspiration of how to set the ir::Value for an output tensor. Here I agree with Milad: I don't believe we need to leave the out variable outside of the reduce_scatter_out call. Leaving out outside doesn't do anything -- out will be destructed immediately after the return statement anyway.

XLATensor is a holder for a shared_ptr<Data>, so it can be used as a temp. XLATensor::SetIrValue (and any other non-const method on XLATensor) manipulates the shared_ptr<Data> member inside XLATensor, so passing bridge::GetXlaTensor(output) into a function that will call non-const method on XLATensor is fine.

NVM, I found the compiler doesn't like it :D

miladm · 2022-02-10T23:27:55Z

torch_xla/csrc/init_python_bindings.cpp

+    at::Tensor& output, const at::Tensor& input,
+    const std::shared_ptr<ir::Value>& token, int64_t dim, int64_t shard_count,
+    const std::vector<std::vector<int64_t>>& replica_groups) {
+  XLATensor out = bridge::GetXlaTensor(output);


JackCaoG requested review from hjm-aws and miladm February 10, 2022 05:28

hjm-aws reviewed Feb 10, 2022

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

hjm-aws reviewed Feb 10, 2022

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

hjm-aws reviewed Feb 10, 2022

View reviewed changes

hjm-aws approved these changes Feb 10, 2022

View reviewed changes

JackCaoG added 3 commits February 10, 2022 22:17

Add all_gather_out and reduce_scatter_out

92b861b

linter

9d54ac7

Remove unnecessary else

65756ff

JackCaoG force-pushed the out_of_place_commun branch from 19f5e1d to 65756ff Compare February 10, 2022 22:30

JackCaoG merged commit 3a239e6 into master Feb 11, 2022

JackCaoG deleted the out_of_place_commun branch February 11, 2022 01:29

miladm approved these changes Feb 11, 2022

View reviewed changes

JackCaoG mentioned this pull request Jun 26, 2023

[Traceable Collecive] Hide token for all_gather #5232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add all_gather_out and reduce_scatter_out #3359

Add all_gather_out and reduce_scatter_out #3359

Uh oh!

JackCaoG commented Feb 10, 2022

Uh oh!

Uh oh!

Uh oh!

hjm-aws Feb 10, 2022

Uh oh!

JackCaoG Feb 10, 2022

Uh oh!

JackCaoG commented Feb 10, 2022

Uh oh!

miladm left a comment •

edited

Loading

Uh oh!

miladm Feb 10, 2022

Uh oh!

JackCaoG Feb 11, 2022

Uh oh!

hjm-aws Mar 10, 2022 •

edited

Loading

Uh oh!

hjm-aws Mar 10, 2022

Uh oh!

miladm Feb 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add all_gather_out and reduce_scatter_out #3359

Add all_gather_out and reduce_scatter_out #3359

Uh oh!

Conversation

JackCaoG commented Feb 10, 2022

Uh oh!

Uh oh!

Uh oh!

hjm-aws Feb 10, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Feb 10, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Feb 10, 2022

Uh oh!

miladm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miladm Feb 10, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

hjm-aws Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hjm-aws Mar 10, 2022

Choose a reason for hiding this comment

Uh oh!

miladm Feb 10, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

miladm left a comment •

edited

Loading

hjm-aws Mar 10, 2022 •

edited

Loading