Skip to content

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Jan 26, 2024

Stack from ghstack (oldest at bottom):

Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things

  1. 'dst' arg for many collectives is always in 'global' rank regardless
    of whether a subgroup is passed in. This needs a doc update
  2. gather_object has a strong dependency on setting the cuda device;
    could we make that smoother?
  3. gather_object also should be happy with an empty list on the dst
    side, imo

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225

Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jan 26, 2024
Copy link

pytorch-bot bot commented Jan 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118359

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9a9057c with merge base d6b556b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 26, 2024
Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Jan 26, 2024
Fixes #118337 by updating docs.
Let's discuss further whether we want any behavior changes.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

ghstack-source-id: fb74977
Pull Request resolved: #118359
@wconstab wconstab requested review from kwen2501 and wanchaol January 26, 2024 04:21
@wconstab
Copy link
Contributor Author

@kwen2501 wonder if you know off the top of your head whether the same doc change i made for the gather variants applies to the src rank for the scatter variants, and any other ops (send/recv, perhaps)? We need to update them all but I ran out of steam tonight.

collective and will contain the output. Must be ``None`` on non-dst
ranks. (default is ``None``)
dst (int, optional): Destination rank. (default is 0)
dst (int, optional): Destination rank on global process group (regardless of 'group' argument). (default is 0)
Copy link
Contributor

@fegin fegin Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to be the case for all our APIS, where the global ranks should always use? Or should we change the behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the changed behavior. But i do not know if it is worth the disruption of making a breaking change. worth a discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to change it, what we might have to do is first add a 'group_rank' arg and add logic to ensure only one of group_rank or rank is passed. Then deprecate rank. then finally remove rank. Or how else?

Copy link
Contributor

@weifengpy weifengpy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve since this is more about updating docs and verify current behavior.
will update docs / unit tests similarly for send/reduce with dst param

return

store = c10d.FileStore(self.file_name, self.world_size)
torch.distributed.init_process_group(backend="nccl", store=store, rank=self.rank, world_size=self.world_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on devgpu with 8 gpus (world_size=8), seems only rank 0...3 calls torch.distributed.init_process_group ? PG does not complain about no init from rank 4...7 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. since I chose to make my test run with exactly 4 gpus for simplicity, i should not rely on self.world_size to report 4 in case someone changes it. I will update this to pass 4 into init_process_group

@requires_nccl()
@skip_if_lt_x_gpu(4)
def test_gather_subgroup(self):
if self.rank > 3:
Copy link
Contributor

@weifengpy weifengpy Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to set world_size explicitly? right now it's hard coded to 4 from from test_c10d_common.AbstractLargeCommTest any way

@property
def world_size(self):
    return 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wrote the test in as simple a way as possible, hardcoding lists of ranks e.g. [0, 1] instead of doing things like ranks = list(range(self.world_size))[:self.world_size/2]. It just felt easier to read and even less LOC this way. It is not that important to have the test run on 8 gpus if they are available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. it's clear now with world_size=4 in init_process_group

Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Jan 27, 2024
Fixes #118337 by updating docs.
Let's discuss further whether we want any behavior changes.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

ghstack-source-id: 3869bec
Pull Request resolved: #118359
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 27, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/wconstab/266/head branch January 30, 2024 15:23
pytorchmergebot pushed a commit that referenced this pull request Jan 30, 2024
Follow up #118359: whether``src`` and ``dst`` are base on global pg or sub pg
* update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments
* communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend``
* communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list``
* ``pytest test/distributed/test_c10d_nccl.py -k subgroup``

3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device
* use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2
* ``torch.cuda.set_device(global_rank)``

Pull Request resolved: #118593
Approved by: https://github.com/wconstab
jeffdaily pushed a commit to ROCm/pytorch that referenced this pull request Feb 8, 2024
…118359)

Addresses pytorch#118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

Pull Request resolved: pytorch#118359
Approved by: https://github.com/weifengpy
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Follow up #118359: whether``src`` and ``dst`` are base on global pg or sub pg
* update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments
* communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend``
* communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list``
* ``pytest test/distributed/test_c10d_nccl.py -k subgroup``

3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device
* use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2
* ``torch.cuda.set_device(global_rank)``

Pull Request resolved: #118593
Approved by: https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants