New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix object-based collectives API to use torch.cuda.current_device instead of rank #46897
Conversation
…tead of rank These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]
…_device instead of rank" rank** These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]
…tead of rank Pull Request resolved: #46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115223259 Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)
💊 CI failures summary and remediationsAs of commit b062f49 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_backward_compatibility_check_test (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
Job | Step | Action |
---|---|---|
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 | Report results | 🔁 rerun |
ci.pytorch.org: 1 failed
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.
This comment has been revised 14 times.
…_device instead of rank" rank** rank** These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]
…tead of rank Pull Request resolved: #46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115266274 Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Added some minor comments.
# true. | ||
current_device = torch.cuda.current_device() | ||
input_tensor = input_tensor.to(current_device) | ||
local_size = local_size.to(current_device) | ||
# Gather all local sizes. This is so that we can find the max size, and index | ||
# until the correct size when deserializing the tensors. | ||
group_size = get_world_size(group=group) | ||
object_sizes_tensor = torch.zeros(group_size, dtype=int).to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can avoid a copy by directly creating the tensor on the desired device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch. Will fix all these callsites.
@@ -1400,7 +1412,7 @@ def all_gather_object(object_list, obj, group=group.WORLD): | |||
input_tensor.resize_(max_object_size) | |||
coalesced_output_tensor = torch.empty( | |||
max_object_size * group_size, dtype=torch.uint8 | |||
).to(my_rank if is_nccl_backend else "cpu") | |||
).to(torch.cuda.current_device() if is_nccl_backend else "cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use a var to dedup this if-else and also merge it with the if is_nccl_backend
clause? E.g.:
current_device = torch.device("cpu")
if is_nccl_backend:
current_device = torch.cuda.current_device()
input_tensor = input_tensor.to(current_device)
...
object_sizes_tensor = torch.zeros(group_size, dtype=int, device=current_device)
...
@@ -1475,7 +1491,7 @@ def gather_object(obj, object_gather_list=None, dst=0, group=group.WORLD): | |||
if my_rank == dst: | |||
coalesced_output_tensor = torch.empty( | |||
max_object_size * group_size, dtype=torch.uint8 | |||
).to(my_rank if is_nccl_backend else "cpu") | |||
).to(torch.cuda.current_device() if is_nccl_backend else "cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Codecov Report
@@ Coverage Diff @@
## gh/rohan-varma/190/base #46897 +/- ##
===========================================================
- Coverage 68.96% 68.88% -0.09%
===========================================================
Files 434 434
Lines 56219 56128 -91
===========================================================
- Hits 38771 38661 -110
- Misses 17448 17467 +19 |
…_device instead of rank" rank** rank** rank** These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Backwards compatibility is not an issue since these APIs have not been publicly announced yet. Also adds/tidies up some documentation. Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/) [ghstack-poisoned]
…tead of rank Pull Request resolved: #46897 These APIs implicitly assumed that gpu for rank == rank index, but that is not necessarily true. For example, the first GPU could be used for a different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we mandate that the user specify the device to use via `torch.cuda.set_device()` before making calls to this API. This expectation should be okay since we clearly document it, and we expect the user to set this for DistributedDataParallel as well. Also adds/tidies up some documentation. ghstack-source-id: 115359633 Differential Revision: [D24556177](https://our.internmc.facebook.com/intern/diff/D24556177/)
CI errors are unrelated, landing |
This pull request has been merged in c7183c9. |
This pull request has been merged in c7183c9. |
Stack from ghstack:
rank
rank**
rank**
rank**
These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via
torch.cuda.set_device()
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well. Backwards compatibility is not an issue since these APIs have not been publicly announced yet.
Also adds/tidies up some documentation.
Differential Revision: D24556177