-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DeviceMesh] Reuse sub_group pg if exists #115716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115716
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7f01d37 with merge base c0732c8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
b61529c
to
e3f2faa
Compare
389211b
to
5c6c1ca
Compare
5c6c1ca
to
ef4cb16
Compare
dd3053d
to
a097281
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! looks great! one comment about testing
# 2) tag for world size pg | ||
# 3) tag for tp pg | ||
# 4) tag for dp pg | ||
self.assertEqual(len(tags_to_pg), 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check before/after init_device_mesh
, tags_to_pg
does not change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check before/after
init_device_mesh
,tags_to_pg
does not change?
Good idea. Will add that to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check before/after
init_device_mesh
,tags_to_pg
does not change?
Actually, we cannot directly compare tags_to_pg
before/after init_device_mesh
call, since PG cannot be pickled. I am using tags_to_pg_name
for comparison instead to make sure they are the same before/after.
# pg or not, it's required that all ranks participate | ||
# in subgroup construction | ||
dim_group = new_group(ranks=subgroup_ranks) | ||
# if dim_group exists for given subgroup_ranks, we re-use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually given that we enable this, can we also delete the _init_process_group
flag in the DeviceMesh
constructor? we don't need that anymore
torch/distributed/device_mesh.py
Outdated
) | ||
dim_group_infos.append( | ||
(_get_group_tag(dim_group), subgroup_ranks) | ||
(_get_group_tag(dim_group), subgroup_ranks) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Curious why we have to add a type: ignore[arg-type]
here instead of fixing the typing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Curious why we have to add a
type: ignore[arg-type]
here instead of fixing the typing?
Ye. Good idea. Gonna be using Chip's not_none()
util to make sure the typing is correct.
486251c
to
d530b5d
Compare
d530b5d
to
3b2f99a
Compare
3b2f99a
to
7f01d37
Compare
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…sh] Reuse sub_group pg if exists (pytorch#115716)" for otest failure Summary: This diff is reverting D53122872 D53122872: [DeviceMesh] Reuse sub_group pg if exists (pytorch#115716) by wz337 has been identified to be causing the following test failure: Tests affected: - [aps/distributed/tests:training_parallelism_core_test - aps.distributed.tests.training_parallelism_core_test.TrainingParallelismCoreTest: test_dp_validation](https://www.internalfb.com/intern/test/281475096321632/) Here's the Multisect link: https://www.internalfb.com/multisect/4140442 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Differential Revision: D53138293
Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225