-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Hide the contiguous requirement for user input mesh when initializing DeviceMesh #110628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110628
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit c3511c1 with merge base dac895c ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D49942839 |
9576a40
to
33925dd
Compare
This pull request was exported from Phabricator. Differential Revision: D49942839 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the fix!
@yoyoyocmu can you sign the CLA? |
… DeviceMesh (pytorch#110628) Summary: As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh. In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided: ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Test Plan: **Unit Test**: ``` buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399 Network: Up: 0B Down: 0B Jobs completed: 6. Time elapsed: 1:58.7s. Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` **Test with MP** ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Without the change: exception. After this change: initialzied sucessfully. Differential Revision: D49942839
33925dd
to
05fcecc
Compare
This pull request was exported from Phabricator. Differential Revision: D49942839 |
@yoyoyocmu I think you may need to add your work email to your github as well. Then, the easyCLA should automtically capture, since you are already added as a contributor here. https://www.internalfb.com/intern/opensource/github/repo/167825833786582/contributors |
/easycla |
… DeviceMesh (pytorch#110628) Summary: As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh. In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided: ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Test Plan: **Unit Test**: ``` buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399 Network: Up: 0B Down: 0B Jobs completed: 6. Time elapsed: 1:58.7s. Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` **Test with MP** ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Without the change: exception. After this change: initialzied sucessfully. Differential Revision: D49942839
This pull request was exported from Phabricator. Differential Revision: D49942839 |
05fcecc
to
c3511c1
Compare
This pull request was exported from Phabricator. Differential Revision: D49942839 |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.
In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
Test Plan:
Unit Test:
Test with MP
Without the change: exception.
After this change: initialzied sucessfully.
Differential Revision: D49942839