Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DTensor] require DeviceMesh size equals world size #91801

Closed
wants to merge 3 commits into from

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91801

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 387458b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@XilunWu
Copy link
Contributor Author

XilunWu commented Jan 6, 2023

Two meshes created over world in test_creat_1d_device_mesh. Need furthur thinking on this PR.

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not always do the check and do it only when we initialize world_pg

@@ -143,6 +143,13 @@ def __init__(
f"Mesh should not be bigger than default world size, but found {self.mesh.numel()} ranks!"
)

# TODO: we will support mesh on a subset of WORLD in future
if self.mesh.numel() < world_size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check should not happen all the time, it should only happen when there's no default pg exist and we want to help user create a world_pg, this check should be only inside get_or_create_group I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense because IIRC we can define for example mesh A on rank 0, 1 and mesh B on rank 2, 3. The example we discussed last time is actually about mesh is defined on rank 0, 1 and no mesh is defined on rank 2, 3. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's possibly to create sub meshes, that's what 2-D did currently, so we should still allow such behavior.

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@XilunWu
Copy link
Contributor Author

XilunWu commented Jan 12, 2023

@pytorchmergebot merge -g

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 12, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@XilunWu XilunWu deleted the gh/XilunWu/9/head branch April 11, 2023 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants