[DeviceMesh] Clarifying flatten use case #161311

fduwjj · 2025-08-22T21:12:28Z

Stack from ghstack (oldest at bottom):

-> [DeviceMesh] Clarifying flatten use case #161311

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding:

In unit test, we assume users can call dp_cp_mesh._flatten() many times but no backend will be created (aka cached).
From the implementation of slicing, we actually throw exception erroring out doing the _flatten more than once. But there is bug which was partially fixed in Do not incorrectly chain each of the strings as iterables #160709 but it does not fixed the check for the case when we call the _flatten twice.

What's more important question to ask is, what behavior we want for _flatten? Do we allow calling _flatten multiple times (with same mesh_name)? I think we should, why?

We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to eq).
We actually cached the flattened mesh today inside root_to_flatten_mapping and actually do the early return but that line will never be reached if we error out before that.

Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it.

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-08-22T21:12:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161311

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ffb69f4 with merge base 5babb4d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 208f663 Pull Request resolved: #161311

torch/distributed/device_mesh.py

test/distributed/test_device_mesh.py

fegin · 2025-08-22T21:34:32Z

I really thought we allow flattening twice. If we already cached the flattened ones, it should be no harm to allow flattening twice.

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 24ae4ed Pull Request resolved: #161311

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 65a1c89 Pull Request resolved: #161311

test/distributed/test_device_mesh.py

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: f48611f Pull Request resolved: #161311

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 852444b Pull Request resolved: #161311

fegin · 2025-08-23T04:39:32Z

torch/distributed/device_mesh.py

+                device_mesh.mesh_dim_names
+            ):
+                return device_mesh
+


hmm, what if users flatten 1D but with a different name? What's the behavior we should expect?

Then we throw exception that is an invalid use right? That's why we have the second check after and.

Thinking about the case where users flattened the (2,4) mesh and somewhere else flattened the (4,2) mesh- in their mind there might not be a clear relationship between the 2 (8,) meshes they created. Perhaps they want to use different names for them, for the logical use case for those meshes. However we should still allow the flatten and reuse the PG if possible. Maybe this is not a common case, but I also don't see the reason we need to error for flattening twice with different names. That's my thinking anyway.

@wconstab I see, that is outside the scope of this PR. For this one I just want to fix the current behavior for now.

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in #160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: e2c998f Pull Request resolved: #161311

fduwjj · 2025-09-10T05:01:14Z

@pytorchbot merge

pytorchmergebot · 2025-09-10T05:03:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in pytorch#160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. Pull Request resolved: pytorch#161311 Approved by: https://github.com/fegin

[DeviceMesh] Clarifying flatten use case

1bdfed4

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 22, 2025

[DeviceMesh] Clarifying flatten use case

cd27f25

ghstack-source-id: 208f663 Pull Request resolved: #161311

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 22, 2025

fduwjj added DeviceMesh release notes: DeviceMesh labels Aug 22, 2025

fduwjj commented Aug 22, 2025

View reviewed changes

torch/distributed/device_mesh.py Show resolved Hide resolved

fduwjj commented Aug 22, 2025

View reviewed changes

test/distributed/test_device_mesh.py Show resolved Hide resolved

fduwjj requested review from fegin, wanchaol, wconstab and wz337 August 22, 2025 21:25

fduwjj added a commit that referenced this pull request Aug 22, 2025

[DeviceMesh] Clarifying flatten use case

17363a9

ghstack-source-id: 24ae4ed Pull Request resolved: #161311

fduwjj added a commit that referenced this pull request Aug 22, 2025

[DeviceMesh] Clarifying flatten use case

87fad1b

ghstack-source-id: 65a1c89 Pull Request resolved: #161311

wconstab reviewed Aug 22, 2025

View reviewed changes

test/distributed/test_device_mesh.py Show resolved Hide resolved

fduwjj added a commit that referenced this pull request Aug 23, 2025

[DeviceMesh] Clarifying flatten use case

c3ba67f

ghstack-source-id: f48611f Pull Request resolved: #161311

fduwjj requested a review from wconstab August 23, 2025 02:30

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 23, 2025

fduwjj added a commit that referenced this pull request Aug 23, 2025

[DeviceMesh] Clarifying flatten use case

89494f0

ghstack-source-id: 852444b Pull Request resolved: #161311

fegin reviewed Aug 23, 2025

View reviewed changes

fegin approved these changes Aug 26, 2025

View reviewed changes

albanD removed the DeviceMesh label Sep 2, 2025

fduwjj added a commit that referenced this pull request Sep 10, 2025

[DeviceMesh] Clarifying flatten use case

2b51d23

ghstack-source-id: e2c998f Pull Request resolved: #161311

pytorchmergebot added the merging label Sep 10, 2025

pytorchmergebot added the Merged label Sep 10, 2025

pytorchmergebot closed this in be8095b Sep 10, 2025

pytorchmergebot removed the merging label Sep 10, 2025

github-actions bot deleted the gh/fduwjj/191/head branch October 11, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeviceMesh] Clarifying flatten use case #161311

[DeviceMesh] Clarifying flatten use case #161311

Uh oh!

fduwjj commented Aug 22, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

fegin commented Aug 22, 2025

Uh oh!

Uh oh!

fegin Aug 23, 2025

Uh oh!

fduwjj Aug 23, 2025 •

edited

Loading

Uh oh!

wconstab Aug 23, 2025

Uh oh!

fduwjj Aug 26, 2025

Uh oh!

fduwjj commented Sep 10, 2025

Uh oh!

pytorchmergebot commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[DeviceMesh] Clarifying flatten use case #161311

[DeviceMesh] Clarifying flatten use case #161311

Uh oh!

Conversation

fduwjj commented Aug 22, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161311

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

fegin commented Aug 22, 2025

Uh oh!

Uh oh!

fegin Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Sep 10, 2025

Uh oh!

pytorchmergebot commented Sep 10, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fduwjj commented Aug 22, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 22, 2025 •

edited

Loading

fduwjj Aug 23, 2025 •

edited

Loading