Skip to content

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Dec 13, 2023

Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:

  1. re-use sub_group pg if it exsits,
  2. create new sub_group pg if it does not exist.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225

Copy link

pytorch-bot bot commented Dec 13, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115716

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7f01d37 with merge base c0732c8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 13, 2023
@wz337 wz337 force-pushed the reuse_pg_device_mesh branch 2 times, most recently from b61529c to e3f2faa Compare December 13, 2023 01:02
@wz337 wz337 force-pushed the reuse_pg_device_mesh branch 2 times, most recently from 389211b to 5c6c1ca Compare December 13, 2023 01:05
@wz337 wz337 marked this pull request as ready for review December 13, 2023 01:08
@wz337 wz337 requested review from awgu and wanchaol December 13, 2023 01:10
@wz337 wz337 force-pushed the reuse_pg_device_mesh branch from 5c6c1ca to ef4cb16 Compare December 13, 2023 01:13
@wz337 wz337 added module: DeviceMesh ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR and removed release notes: DeviceMesh labels Dec 13, 2023
@wz337 wz337 force-pushed the reuse_pg_device_mesh branch 2 times, most recently from dd3053d to a097281 Compare December 13, 2023 03:14
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! looks great! one comment about testing

# 2) tag for world size pg
# 3) tag for tp pg
# 4) tag for dp pg
self.assertEqual(len(tags_to_pg), 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check before/after init_device_mesh, tags_to_pg does not change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check before/after init_device_mesh, tags_to_pg does not change?

Good idea. Will add that to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check before/after init_device_mesh, tags_to_pg does not change?

Actually, we cannot directly compare tags_to_pg before/after init_device_mesh call, since PG cannot be pickled. I am using tags_to_pg_name for comparison instead to make sure they are the same before/after.

# pg or not, it's required that all ranks participate
# in subgroup construction
dim_group = new_group(ranks=subgroup_ranks)
# if dim_group exists for given subgroup_ranks, we re-use it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually given that we enable this, can we also delete the _init_process_group flag in the DeviceMesh constructor? we don't need that anymore

)
dim_group_infos.append(
(_get_group_tag(dim_group), subgroup_ranks)
(_get_group_tag(dim_group), subgroup_ranks) # type: ignore[arg-type]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Curious why we have to add a type: ignore[arg-type] here instead of fixing the typing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Curious why we have to add a type: ignore[arg-type] here instead of fixing the typing?

Ye. Good idea. Gonna be using Chip's not_none() util to make sure the typing is correct.

@wz337 wz337 force-pushed the reuse_pg_device_mesh branch 2 times, most recently from 486251c to d530b5d Compare December 18, 2023 23:06
@wz337 wz337 force-pushed the reuse_pg_device_mesh branch from d530b5d to 3b2f99a Compare January 24, 2024 20:47
@wz337 wz337 force-pushed the reuse_pg_device_mesh branch from 3b2f99a to 7f01d37 Compare January 25, 2024 06:45
@wz337
Copy link
Contributor Author

wz337 commented Jan 25, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

wz337 pushed a commit to wz337/pytorch that referenced this pull request Jan 29, 2024
…sh] Reuse sub_group pg if exists (pytorch#115716)" for otest failure

Summary:
This diff is reverting D53122872
D53122872: [DeviceMesh] Reuse sub_group pg if exists (pytorch#115716) by wz337 has been identified to be causing the following test failure:

Tests affected:
- [aps/distributed/tests:training_parallelism_core_test - aps.distributed.tests.training_parallelism_core_test.TrainingParallelismCoreTest: test_dp_validation](https://www.internalfb.com/intern/test/281475096321632/)

Here's the Multisect link:
https://www.internalfb.com/multisect/4140442
Here are the tasks that are relevant to this breakage:


We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Differential Revision: D53138293
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: DeviceMesh oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants