Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897

rohan-varma · 2020-10-27T01:49:52Z

Stack from ghstack:

Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897 Fix object-based collectives API to use torch.cuda.current_device instead of
rank
rank**
rank**
rank**

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via torch.cuda.set_device()
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well. Backwards compatibility is not an issue since these APIs have not been publicly announced yet.

Also adds/tidies up some documentation.

Differential Revision: D24556177

…tead of rank These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]

…_device instead of rank" rank** These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]

…tead of rank Pull Request resolved: #46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115223259 Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

dr-ci · 2020-10-27T02:35:53Z

💊 CI failures summary and remediations

As of commit b062f49 (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 1/3 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Oct 28 15:41:11 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.

Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.quantized.EmbeddingPackedParamsBase _0, (int, Tensor[], float[], int[]) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  bit_rate(__torch__.torch.classes.quantized.EmbeddingPackedParamsBase _0) -> (int _0) 
Oct 28 15:41:11 processing existing schema:  version(__torch__.torch.classes.quantized.EmbeddingPackedParamsBase _0) -> (int _0) 
Oct 28 15:41:11 processing existing schema:  __getstate__(__torch__.torch.classes.xnnpack.LinearOpContext _0) -> ((Tensor, Tensor?, Scalar?, Scalar?) _0) 
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.xnnpack.LinearOpContext _0, (Tensor, Tensor?, Scalar?, Scalar?) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  __getstate__(__torch__.torch.classes.xnnpack.Conv2dOpContext _0) -> ((Tensor, Tensor?, int[], int[], int[], int, Scalar?, Scalar?) _0) 
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.xnnpack.Conv2dOpContext _0, (Tensor, Tensor?, int[], int[], int[], int, Scalar?, Scalar?) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  __getstate__(__torch__.torch.classes.xnnpack.TransposeConv2dOpContext _0) -> ((Tensor, Tensor?, int[], int[], int[], int[], int, Scalar?, Scalar?) _0) 
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.xnnpack.TransposeConv2dOpContext _0, (Tensor, Tensor?, int[], int[], int[], int[], int, Scalar?, Scalar?) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (None _0) 
Oct 28 15:41:11 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Oct 28 15:41:11  
Oct 28 15:41:11 Broken ops: [ 
Oct 28 15:41:11 	aten::_foreach_zero_(Tensor[] self) -> () 
Oct 28 15:41:11 ] 
Oct 28 15:41:11 + cleanup 
Oct 28 15:41:11 + retcode=1 
Oct 28 15:41:11 + set +x 
Oct 28 15:41:11 =================== sccache compilation log =================== 
Oct 28 15:41:11 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Oct 28 15:41:11 Compile requests                      0

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1}	^{Report results}	🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.8-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

…_device instead of rank" rank** rank** These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]

…tead of rank Pull Request resolved: #46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115266274 Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

mrshenli

LGTM! Added some minor comments.

mrshenli · 2020-10-27T17:56:07Z

torch/distributed/distributed_c10d.py

+        # true.
+        current_device = torch.cuda.current_device()
+        input_tensor = input_tensor.to(current_device)
+        local_size = local_size.to(current_device)
    # Gather all local sizes. This is so that we can find the max size, and index
    # until the correct size when deserializing the tensors.
    group_size = get_world_size(group=group)
    object_sizes_tensor = torch.zeros(group_size, dtype=int).to(


nit: we can avoid a copy by directly creating the tensor on the desired device?

yes, good catch. Will fix all these callsites.

mrshenli · 2020-10-27T18:00:06Z

torch/distributed/distributed_c10d.py

@@ -1400,7 +1412,7 @@ def all_gather_object(object_list, obj, group=group.WORLD):
    input_tensor.resize_(max_object_size)
    coalesced_output_tensor = torch.empty(
        max_object_size * group_size, dtype=torch.uint8
-    ).to(my_rank if is_nccl_backend else "cpu")
+    ).to(torch.cuda.current_device() if is_nccl_backend else "cpu")


nit: use a var to dedup this if-else and also merge it with the if is_nccl_backend clause? E.g.:

current_device = torch.device("cpu") if is_nccl_backend: current_device = torch.cuda.current_device() input_tensor = input_tensor.to(current_device) ... object_sizes_tensor = torch.zeros(group_size, dtype=int, device=current_device) ...

mrshenli · 2020-10-27T18:00:20Z

torch/distributed/distributed_c10d.py

@@ -1475,7 +1491,7 @@ def gather_object(obj, object_gather_list=None, dst=0, group=group.WORLD):
    if my_rank == dst:
        coalesced_output_tensor = torch.empty(
            max_object_size * group_size, dtype=torch.uint8
-        ).to(my_rank if is_nccl_backend else "cpu")
+        ).to(torch.cuda.current_device() if is_nccl_backend else "cpu")


codecov · 2020-10-27T20:29:22Z

Codecov Report

Merging #46897 into gh/rohan-varma/190/base will decrease coverage by 0.08%.
The diff coverage is 0.00%.

@@                     Coverage Diff                     @@
##           gh/rohan-varma/190/base   #46897      +/-   ##
===========================================================
- Coverage                    68.96%   68.88%   -0.09%     
===========================================================
  Files                          434      434              
  Lines                        56219    56128      -91     
===========================================================
- Hits                         38771    38661     -110     
- Misses                       17448    17467      +19

…_device instead of rank" rank** rank** rank** These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Backwards compatibility is not an issue since these APIs have not been publicly announced yet. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]

…tead of rank Pull Request resolved: #46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115359633 Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

rohan-varma · 2020-10-29T00:55:14Z

CI errors are unrelated, landing

facebook-github-bot · 2020-10-29T02:15:47Z

This pull request has been merged in c7183c9.

facebook-github-bot · 2020-10-29T02:15:56Z

This pull request has been merged in c7183c9.

rohan-varma requested review from apaszke, mingzhe09088, mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners October 27, 2020 01:49

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 27, 2020

mrshenli approved these changes Oct 27, 2020

View reviewed changes

facebook-github-bot closed this in c7183c9 Oct 29, 2020

facebook-github-bot added the Merged label Oct 29, 2020

facebook-github-bot deleted the gh/rohan-varma/190/head branch November 1, 2020 15:17

blefaudeux mentioned this pull request Jan 27, 2021

[fix] OSS: removing the torch broadcast util altogether, broken on 1.7.1 facebookresearch/fairscale#329

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897

Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897

rohan-varma commented Oct 27, 2020 •

edited

dr-ci bot commented Oct 27, 2020 •

edited

mrshenli left a comment

mrshenli Oct 27, 2020

rohan-varma Oct 27, 2020

mrshenli Oct 27, 2020

mrshenli Oct 27, 2020

codecov bot commented Oct 27, 2020 •

edited

rohan-varma commented Oct 29, 2020

facebook-github-bot commented Oct 29, 2020

facebook-github-bot commented Oct 29, 2020

Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897

Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897

Conversation

rohan-varma commented Oct 27, 2020 • edited

dr-ci bot commented Oct 27, 2020 • edited

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_backward_compatibility_check_test (1/1)

1 failure not recognized by patterns:

ci.pytorch.org: 1 failed

mrshenli left a comment

Choose a reason for hiding this comment

mrshenli Oct 27, 2020

Choose a reason for hiding this comment

rohan-varma Oct 27, 2020

Choose a reason for hiding this comment

mrshenli Oct 27, 2020

Choose a reason for hiding this comment

mrshenli Oct 27, 2020

Choose a reason for hiding this comment

codecov bot commented Oct 27, 2020 • edited

Codecov Report

rohan-varma commented Oct 29, 2020

facebook-github-bot commented Oct 29, 2020

facebook-github-bot commented Oct 29, 2020

rohan-varma commented Oct 27, 2020 •

edited

dr-ci bot commented Oct 27, 2020 •

edited

codecov bot commented Oct 27, 2020 •

edited