Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897

Closed
wants to merge 4 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Oct 27, 2020

Stack from ghstack:

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via torch.cuda.set_device()
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well. Backwards compatibility is not an issue since these APIs have not been publicly announced yet.

Also adds/tidies up some documentation.

Differential Revision: D24556177

…tead of

rank

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 27, 2020
…_device instead of

rank"

rank**

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 27, 2020
…tead of

rank

Pull Request resolved: #46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115223259

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)
@dr-ci
Copy link

dr-ci bot commented Oct 27, 2020

💊 CI failures summary and remediations

As of commit b062f49 (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Oct 28 15:41:11 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.quantized.EmbeddingPackedParamsBase _0, (int, Tensor[], float[], int[]) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  bit_rate(__torch__.torch.classes.quantized.EmbeddingPackedParamsBase _0) -> (int _0) 
Oct 28 15:41:11 processing existing schema:  version(__torch__.torch.classes.quantized.EmbeddingPackedParamsBase _0) -> (int _0) 
Oct 28 15:41:11 processing existing schema:  __getstate__(__torch__.torch.classes.xnnpack.LinearOpContext _0) -> ((Tensor, Tensor?, Scalar?, Scalar?) _0) 
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.xnnpack.LinearOpContext _0, (Tensor, Tensor?, Scalar?, Scalar?) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  __getstate__(__torch__.torch.classes.xnnpack.Conv2dOpContext _0) -> ((Tensor, Tensor?, int[], int[], int[], int, Scalar?, Scalar?) _0) 
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.xnnpack.Conv2dOpContext _0, (Tensor, Tensor?, int[], int[], int[], int, Scalar?, Scalar?) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  __getstate__(__torch__.torch.classes.xnnpack.TransposeConv2dOpContext _0) -> ((Tensor, Tensor?, int[], int[], int[], int[], int, Scalar?, Scalar?) _0) 
Oct 28 15:41:11 processing existing schema:  __setstate__(__torch__.torch.classes.xnnpack.TransposeConv2dOpContext _0, (Tensor, Tensor?, int[], int[], int[], int[], int, Scalar?, Scalar?) _1) -> (None _0) 
Oct 28 15:41:11 processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (None _0) 
Oct 28 15:41:11 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Oct 28 15:41:11  
Oct 28 15:41:11 Broken ops: [ 
Oct 28 15:41:11 	aten::_foreach_zero_(Tensor[] self) -> () 
Oct 28 15:41:11 ] 
Oct 28 15:41:11 + cleanup 
Oct 28 15:41:11 + retcode=1 
Oct 28 15:41:11 + set +x 
Oct 28 15:41:11 =================== sccache compilation log =================== 
Oct 28 15:41:11 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Oct 28 15:41:11 Compile requests                      0 

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 Report results 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

…_device instead of

rank"

rank**
rank**

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 27, 2020
…tead of

rank

Pull Request resolved: #46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115266274

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added some minor comments.

# true.
current_device = torch.cuda.current_device()
input_tensor = input_tensor.to(current_device)
local_size = local_size.to(current_device)
# Gather all local sizes. This is so that we can find the max size, and index
# until the correct size when deserializing the tensors.
group_size = get_world_size(group=group)
object_sizes_tensor = torch.zeros(group_size, dtype=int).to(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can avoid a copy by directly creating the tensor on the desired device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch. Will fix all these callsites.

@@ -1400,7 +1412,7 @@ def all_gather_object(object_list, obj, group=group.WORLD):
input_tensor.resize_(max_object_size)
coalesced_output_tensor = torch.empty(
max_object_size * group_size, dtype=torch.uint8
).to(my_rank if is_nccl_backend else "cpu")
).to(torch.cuda.current_device() if is_nccl_backend else "cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use a var to dedup this if-else and also merge it with the if is_nccl_backend clause? E.g.:

current_device = torch.device("cpu")
if is_nccl_backend:
    current_device = torch.cuda.current_device()
    input_tensor = input_tensor.to(current_device)
    ...
object_sizes_tensor = torch.zeros(group_size, dtype=int, device=current_device)
...

@@ -1475,7 +1491,7 @@ def gather_object(obj, object_gather_list=None, dst=0, group=group.WORLD):
if my_rank == dst:
coalesced_output_tensor = torch.empty(
max_object_size * group_size, dtype=torch.uint8
).to(my_rank if is_nccl_backend else "cpu")
).to(torch.cuda.current_device() if is_nccl_backend else "cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@codecov
Copy link

codecov bot commented Oct 27, 2020

Codecov Report

Merging #46897 into gh/rohan-varma/190/base will decrease coverage by 0.08%.
The diff coverage is 0.00%.

@@                     Coverage Diff                     @@
##           gh/rohan-varma/190/base   #46897      +/-   ##
===========================================================
- Coverage                    68.96%   68.88%   -0.09%     
===========================================================
  Files                          434      434              
  Lines                        56219    56128      -91     
===========================================================
- Hits                         38771    38661     -110     
- Misses                       17448    17467      +19     

…_device instead of

rank"

rank**
rank**
rank**

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well. Backwards compatibility is not an issue since these APIs have not been publicly announced yet. 

Also adds/tidies up some documentation.

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 28, 2020
…tead of

rank

Pull Request resolved: #46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115359633

Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)
@rohan-varma
Copy link
Member Author

CI errors are unrelated, landing

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c7183c9.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c7183c9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants