scatter_object_list API for c10d #43930

rohan-varma · 2020-09-01T02:02:01Z

Stack from ghstack:

scatter_object_list API for c10d #43930 scatter_object_list API for c10d

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the scatter is stored as the first element of scatter_object_output_list, and the src rank is expected to provide an input list scatter_object_input_list which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: D23430686

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

dr-ci · 2020-09-01T02:12:15Z

💊 CI failures summary and remediations

As of commit 370522c (more details on the Dr. CI page):

3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_build (1/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 04 21:51:33 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type

Dec 04 21:51:33                         ^ 
Dec 04 21:51:33 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type 
Dec 04 21:51:33  using ERVecXU8 = Eigen::RowVectorX<uint8_t>; 
Dec 04 21:51:33                          ^ 
Dec 04 21:51:33 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/utils/math/elementwise.cc.o] Error 1 
Dec 04 21:51:33 caffe2/CMakeFiles/torch_cpu.dir/build.make:9475: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/utils/math/elementwise.cc.o' failed 
Dec 04 21:51:33 In file included from /var/lib/jenkins/workspace/caffe2/utils/math/reduce.cc:18:0: 
Dec 04 21:51:33 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:110:24: error: 'RowVector' in namespace 'Eigen' does not name a template type 
Dec 04 21:51:33  using ERVecXt = Eigen::RowVector<T, Eigen::Dynamic>; 
Dec 04 21:51:33                         ^ 
Dec 04 21:51:33 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type 
Dec 04 21:51:33  using ERVecXU8 = Eigen::RowVectorX<uint8_t>; 
Dec 04 21:51:33                          ^ 
Dec 04 21:51:33 caffe2/CMakeFiles/torch_cpu.dir/build.make:9499: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/utils/math/reduce.cc.o' failed 
Dec 04 21:51:33 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/utils/math/reduce.cc.o] Error 1 
Dec 04 21:51:33 CMakeFiles/Makefile2:8261: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/all' failed 
Dec 04 21:51:33 make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2 
Dec 04 21:51:33 make: *** [all] Error 2 
Dec 04 21:51:33 Makefile:138: recipe for target 'all' failed 
Dec 04 21:51:33 -- Building version 1.8.0a0 
THON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/var/lib/jenkins/workspace/torch -DCMAKE_PREFIX_PATH=/opt/conda/lib/python3.6/site-packages -DNUMPY_INCLUDE_DIR=/opt/conda/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/opt/conda/bin/python -DPYTHON_INCLUDE_DIR=/opt/conda/include/python3.6m -DPYTHON_LIBRARY=/opt/conda/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.8.0a0 -DUSE_LLVM=/opt/llvm -DUSE_NUMPY=True -DWERROR=1 /var/lib/jenkins/workspace

pytorch_xla_linux_bionic_py3_6_clang9_build (2/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 04 21:56:27 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: no template named 'RowVectorX' in namespace 'Eigen'

Dec 04 21:56:23 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: no template named 'RowVectorX' in namespace 'Eigen' 
Dec 04 21:56:23 using ERVecXU8 = Eigen::RowVectorX<uint8_t>; 
Dec 04 21:56:23                  ~~~~~~~^ 
Dec 04 21:56:23 2 errors generated. 
Dec 04 21:56:23 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/utils/math/elementwise.cc.o] Error 1 
Dec 04 21:56:23 caffe2/CMakeFiles/torch_cpu.dir/build.make:5270: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/utils/math/elementwise.cc.o' failed 
Dec 04 21:56:27 In file included from /var/lib/jenkins/workspace/caffe2/utils/math/reduce.cc:18: 
Dec 04 21:56:27 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:110:24: error: no template named 'RowVector' in namespace 'Eigen' 
Dec 04 21:56:27 using ERVecXt = Eigen::RowVector<T, Eigen::Dynamic>; 
Dec 04 21:56:27                 ~~~~~~~^ 
Dec 04 21:56:27 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: no template named 'RowVectorX' in namespace 'Eigen' 
Dec 04 21:56:27 using ERVecXU8 = Eigen::RowVectorX<uint8_t>; 
Dec 04 21:56:27                  ~~~~~~~^ 
Dec 04 21:56:27 2 errors generated. 
Dec 04 21:56:27 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/utils/math/reduce.cc.o] Error 1 
Dec 04 21:56:27 caffe2/CMakeFiles/torch_cpu.dir/build.make:5283: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/utils/math/reduce.cc.o' failed 
Dec 04 21:56:27 make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2 
Dec 04 21:56:27 CMakeFiles/Makefile2:7627: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/all' failed 
Dec 04 21:56:27 make: *** [all] Error 2 
Dec 04 21:56:27 Makefile:159: recipe for target 'all' failed 
Dec 04 21:56:27 -- Building version 1.8.0a0

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (3/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 04 21:54:40 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type

Dec 04 21:54:37                         ^ 
Dec 04 21:54:37 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type 
Dec 04 21:54:37  using ERVecXU8 = Eigen::RowVectorX<uint8_t>; 
Dec 04 21:54:37                          ^ 
Dec 04 21:54:37 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/utils/math/elementwise.cc.o] Error 1 
Dec 04 21:54:37 caffe2/CMakeFiles/torch_cpu.dir/build.make:9475: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/utils/math/elementwise.cc.o' failed 
Dec 04 21:54:40 In file included from /var/lib/jenkins/workspace/caffe2/utils/math/reduce.cc:18:0: 
Dec 04 21:54:40 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:110:24: error: 'RowVector' in namespace 'Eigen' does not name a template type 
Dec 04 21:54:40  using ERVecXt = Eigen::RowVector<T, Eigen::Dynamic>; 
Dec 04 21:54:40                         ^ 
Dec 04 21:54:40 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type 
Dec 04 21:54:40  using ERVecXU8 = Eigen::RowVectorX<uint8_t>; 
Dec 04 21:54:40                          ^ 
Dec 04 21:54:40 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/utils/math/reduce.cc.o] Error 1 
Dec 04 21:54:40 caffe2/CMakeFiles/torch_cpu.dir/build.make:9499: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/utils/math/reduce.cc.o' failed 
Dec 04 21:54:40 make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2 
Dec 04 21:54:40 CMakeFiles/Makefile2:9984: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/all' failed 
Dec 04 21:54:40 Makefile:138: recipe for target 'all' failed 
Dec 04 21:54:40 make: *** [all] Error 2 
Dec 04 21:54:40 -- Building version 1.8.0a0 
YPE=Release -DCMAKE_INSTALL_PREFIX=/var/lib/jenkins/workspace/torch -DCMAKE_PREFIX_PATH=/opt/conda/lib/python3.6/site-packages -DCUDA_NVCC_EXECUTABLE=/opt/cache/lib/nvcc -DNUMPY_INCLUDE_DIR=/opt/conda/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/opt/conda/bin/python -DPYTHON_INCLUDE_DIR=/opt/conda/include/python3.6m -DPYTHON_LIBRARY=/opt/conda/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.8.0a0 -DUSE_LLVM=/opt/llvm -DUSE_NUMPY=True -DWERROR=1 /var/lib/jenkins/workspace

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 42 times.

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

rohan-varma · 2020-09-02T19:31:54Z

torch/distributed/distributed_c10d.py

+    )
+
+    # Deserialize back to object
+    scatter_object_output_list[0] = _tensor_to_object(output_tensor, obj_tensor_size)


Currently we only support only one object being scattered onto a rank, which is similar to the scatter tensor collective, but this can be extended such that a rank can get multiple objects as well.

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

rohan-varma · 2020-11-04T17:39:27Z

@mrshenli Just pinging this diff since it is the last remaining object-based collectives API that we would like to support.

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

mrshenli

LGTM! Left some minor comments

mrshenli · 2020-12-04T15:15:51Z

torch/distributed/distributed_c10d.py

+            element will store the object scattered to this rank.
+        scatter_object_input_list (List[Any]): List of input objects to scatter.
+            Each object must be picklable. Only objects on the ``src`` rank will
+             be scattered, and the argument can be ``None`` for non-src ranks.


mrshenli · 2020-12-04T15:18:15Z

torch/distributed/distributed_c10d.py

+        :func:`scatter_object_list` uses ``pickle`` module implicitly, which
+        is known to be insecure. It is possible to construct malicious pickle
+        data which will execute arbitrary code during unpickling. Only call this
+        function with data you trust.


shall we add an example section to the docstring?

please ignore this. I saw it is added in the next PR.

mrshenli · 2020-12-04T15:22:37Z

torch/distributed/distributed_c10d.py

+            "Expected argument scatter_object_output_list to be a list of size at least 1."
+        )
+
+    my_rank = get_rank()


do we need to pass in the group arg here?

Good catch, looks like I missed this in the other APIs as well. Will add them and tests that would have caught this.

Although, I guess the implementation is not buggy, since we pass in the group to all the underlying comm apis.

mrshenli · 2020-12-04T15:27:28Z

torch/distributed/distributed_c10d.py

+    )
+
+    # Scatter per-object sizes to trim tensors when deserializing back to object
+    scatter(


(not asking for revision, and I am not sure if this option will be faster or slower) another option is to pack this into the broadcast call so that we can save one comm op, but will need to comm more data.

Sounds good, we can run a benchmark and send a follow up PR if there is indeed a perf benefit.

mrshenli · 2020-12-04T15:28:37Z

torch/testing/_internal/distributed/distributed_test.py

+                if self.rank == src_rank
+                else [None for _ in collectives_object_test_list]
+            )
+            scatter_list = scatter_list[: int(os.environ["WORLD_SIZE"])]


can this be dist.get_world_size()?

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

Pull Request resolved: #43930 Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. ghstack-source-id: 117904065 Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

rohan-varma · 2020-12-05T02:08:51Z

CI error is unrelated:

Dec 04 21:51:33 /var/lib/jenkins/workspace/caffe2/utils/eigen_utils.h:113:25: error: 'RowVectorX' in namespace 'Eigen' does not name a template type
Dec 04 21:51:33  using ERVecXU8 = Eigen::RowVectorX<uint8_t>;

facebook-github-bot · 2020-12-05T03:18:38Z

This pull request has been merged in 02d89f9.

rohan-varma requested review from apaszke, mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners September 1, 2020 02:02

rohan-varma mentioned this pull request Sep 1, 2020

broadcast_object API for c10d #43887

Closed

rohan-varma mentioned this pull request Sep 1, 2020

[Docs] Add examples for new object-based c10d APIs #43932

Closed

rohan-varma added 4 commits August 31, 2020 19:28

rohan-varma commented Sep 2, 2020

View reviewed changes

facebook-github-bot added the cla signed label Oct 30, 2020

rohan-varma requested a review from mingzhe09088 as a code owner December 2, 2020 08:26

mrshenli approved these changes Dec 4, 2020

View reviewed changes

facebook-github-bot closed this in 02d89f9 Dec 5, 2020

facebook-github-bot added the Merged label Dec 5, 2020

facebook-github-bot deleted the gh/rohan-varma/161/head branch December 8, 2020 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scatter_object_list API for c10d #43930

scatter_object_list API for c10d #43930

rohan-varma commented Sep 1, 2020 •

edited

dr-ci bot commented Sep 1, 2020 •

edited

rohan-varma Sep 2, 2020

rohan-varma commented Nov 4, 2020

mrshenli left a comment

mrshenli Dec 4, 2020

mrshenli Dec 4, 2020

mrshenli Dec 4, 2020

mrshenli Dec 4, 2020

rohan-varma Dec 4, 2020

rohan-varma Dec 4, 2020

mrshenli Dec 4, 2020

rohan-varma Dec 4, 2020

mrshenli Dec 4, 2020

rohan-varma commented Dec 5, 2020

facebook-github-bot commented Dec 5, 2020

scatter_object_list API for c10d #43930

scatter_object_list API for c10d #43930

Conversation

rohan-varma commented Sep 1, 2020 • edited

dr-ci bot commented Sep 1, 2020 • edited

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_build (1/3)

pytorch_xla_linux_bionic_py3_6_clang9_build (2/3)

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (3/3)

Choose a reason for hiding this comment

rohan-varma commented Nov 4, 2020

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma commented Dec 5, 2020

facebook-github-bot commented Dec 5, 2020

rohan-varma commented Sep 1, 2020 •

edited

dr-ci bot commented Sep 1, 2020 •

edited