Skip to content

Conversation

yoyoyocmu
Copy link
Contributor

Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.

In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:

mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)

Test Plan:
Unit Test:

buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh

Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B  Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

Test with MP

mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)

Without the change: exception.
After this change: initialzied sucessfully.

Differential Revision: D49942839

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 5, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110628

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit c3511c1 with merge base dac895c (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49942839

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49942839

@wz337 wz337 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Oct 5, 2023
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the fix!

@wanchaol
Copy link
Collaborator

wanchaol commented Oct 5, 2023

@yoyoyocmu can you sign the CLA?

yoyoyocmu added a commit to yoyoyocmu/pytorch that referenced this pull request Oct 5, 2023
… DeviceMesh (pytorch#110628)

Summary:

As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.

In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```

Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh

Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B  Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.

Differential Revision: D49942839
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49942839

@wz337
Copy link
Contributor

wz337 commented Oct 5, 2023

@yoyoyocmu I think you may need to add your work email to your github as well. Then, the easyCLA should automtically capture, since you are already added as a contributor here. https://www.internalfb.com/intern/opensource/github/repo/167825833786582/contributors

@yoyoyocmu
Copy link
Contributor Author

/easycla

… DeviceMesh (pytorch#110628)

Summary:

As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.

In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```

Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh

Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B  Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.

Differential Revision: D49942839
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49942839

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49942839

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged module: DeviceMesh topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants