-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[C10D] Add tests for gather and gather_object with subgroup #118359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Addresses #118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118359
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9a9057c with merge base d6b556b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Addresses #118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Fixes #118337 by updating docs. Let's discuss further whether we want any behavior changes. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo ghstack-source-id: fb74977 Pull Request resolved: #118359
@kwen2501 wonder if you know off the top of your head whether the same doc change i made for the gather variants applies to the src rank for the scatter variants, and any other ops (send/recv, perhaps)? We need to update them all but I ran out of steam tonight. |
collective and will contain the output. Must be ``None`` on non-dst | ||
ranks. (default is ``None``) | ||
dst (int, optional): Destination rank. (default is 0) | ||
dst (int, optional): Destination rank on global process group (regardless of 'group' argument). (default is 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected to be the case for all our APIS, where the global ranks should always use? Or should we change the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the changed behavior. But i do not know if it is worth the disruption of making a breaking change. worth a discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to change it, what we might have to do is first add a 'group_rank' arg and add logic to ensure only one of group_rank or rank is passed. Then deprecate rank. then finally remove rank. Or how else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve since this is more about updating docs and verify current behavior.
will update docs / unit tests similarly for send/reduce with dst
param
test/distributed/test_c10d_nccl.py
Outdated
return | ||
|
||
store = c10d.FileStore(self.file_name, self.world_size) | ||
torch.distributed.init_process_group(backend="nccl", store=store, rank=self.rank, world_size=self.world_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on devgpu with 8 gpus (world_size=8), seems only rank 0...3 calls torch.distributed.init_process_group
? PG does not complain about no init from rank 4...7 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. since I chose to make my test run with exactly 4 gpus for simplicity, i should not rely on self.world_size to report 4 in case someone changes it. I will update this to pass 4 into init_process_group
test/distributed/test_c10d_nccl.py
Outdated
@requires_nccl() | ||
@skip_if_lt_x_gpu(4) | ||
def test_gather_subgroup(self): | ||
if self.rank > 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to set world_size explicitly? right now it's hard coded to 4 from from test_c10d_common.AbstractLargeCommTest
any way
@property
def world_size(self):
return 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wrote the test in as simple a way as possible, hardcoding lists of ranks e.g. [0, 1] instead of doing things like ranks = list(range(self.world_size))[:self.world_size/2]
. It just felt easier to read and even less LOC this way. It is not that important to have the test run on 8 gpus if they are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. it's clear now with world_size=4
in init_process_group
Addresses #118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Fixes #118337 by updating docs. Let's discuss further whether we want any behavior changes. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo ghstack-source-id: 3869bec Pull Request resolved: #118359
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Follow up #118359: whether``src`` and ``dst`` are base on global pg or sub pg * update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments * communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend`` * communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list`` * ``pytest test/distributed/test_c10d_nccl.py -k subgroup`` 3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device * use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2 * ``torch.cuda.set_device(global_rank)`` Pull Request resolved: #118593 Approved by: https://github.com/wconstab
…118359) Addresses pytorch#118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo Pull Request resolved: pytorch#118359 Approved by: https://github.com/weifengpy
Follow up #118359: whether``src`` and ``dst`` are base on global pg or sub pg * update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments * communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend`` * communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list`` * ``pytest test/distributed/test_c10d_nccl.py -k subgroup`` 3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device * use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2 * ``torch.cuda.set_device(global_rank)`` Pull Request resolved: #118593 Approved by: https://github.com/wconstab
Stack from ghstack (oldest at bottom):
Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.
Identifies a couple of confusing things
of whether a subgroup is passed in. This needs a doc update
could we make that smoother?
side, imo
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225