Skip to content

Conversation

lw
Copy link
Contributor

@lw lw commented Jul 29, 2025

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 29, 2025
Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159371

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 1 Unrelated Failure

As of commit 9927ca7 with merge base 908c5cc (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

  • pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
    /var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lw added a commit that referenced this pull request Jul 29, 2025
[ghstack-poisoned]
lw added a commit that referenced this pull request Jul 29, 2025
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! though it would be nice to also test that the options gets passed correctly. (or maybe this is covered by existing tests?)

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and leave some comments. Stamp to unblock

# Inherit from the parent group if no options are specified for the group.
if dim in _mesh_resources.mesh_dim_group_options:
if pg_backend_and_options[dim] != (None, None):
raise RuntimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe we can let pg_backend_and_options to override what the mesh has set already? No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to worry right now about which of the two gets priority if they're both specified, so I decided to throw in such cases. Do you have an argument for why the argument should take preference over the global?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I think the argument is that if user passed in an option explicitly that means overriding existing one?

Comment on lines 1068 to 1069
if mesh_dim_name in pg_backend_and_options:
if mesh_dim_idx in pg_backend_and_options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you might want to consolidate these two into one if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we do (like now):

if a:
    if b:
        raise
    do()

or (as you suggest)

if a and b:
    raise
if a:
    do()

I prefer the first one since it doesn't repeat a twice.

mesh_shape: tuple[int, ...],
*,
mesh_dim_names: Optional[tuple[str, ...]] = None,
pg_backend_and_options: Optional[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you kindly add a comment in L1040 to say that the key here can be a dim and a dim name?

Comment on lines 1078 to 1089
if pg_backend_and_options is not None:
pg_backend_and_options_tuple = tuple(
pg_backend_and_options.pop(i, (None, None))
for i in range(len(mesh_shape))
)
if pg_backend_and_options:
raise RuntimeError(
f"Found invalid keys in pg_backend_and_options: got {list(pg_backend_and_options.keys())}, "
f"expected integers in range [0, {len(mesh_shape)}) or one of {mesh_dim_names or []}"
)
else:
pg_backend_and_options_tuple = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we can move this logic into the previous for-loop? Looks like we are looping the mesh twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous for-loop is only entered if mesh_dim_names is not None. So we could merge this into a single loop, but this loop would then have to do redundant checks on mesh_dim_names, and that would also be redundant (in a different way).

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the feature! I have some suggestions on how the UX should be look like, requesting changes to align on the UX.

My opinion is that we should separate the backend selection and the pg options, instead of blending them together, this would make the argument more concise. Here is the proposed UX:

# switch between different backends to be controlled by the device_type argument, this is aligned with how process group backend selection works
init_device_mesh("cuda:nccl", ..)
# Specify backend options can be passed in by sth like
init_device_mesh("cuda:nccl", mesh_shape, mesh_dim_names=("dp, "tp), mesh_dim_options={"dp": nccl_pg_options})

Basically mesh_dim_options should be a type of Dict[str, C10dBackend.Options], I think we can support str as key only but adding int together also sounds good.

mesh_shape: tuple[int, ...],
*,
mesh_dim_names: Optional[tuple[str, ...]] = None,
pg_backend_and_options: Optional[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I would prefer to:

  1. separate the pg backend and pg options, these should be separately controlled IMO
  2. name this as mesh_dim_options instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it useless to prepend mesh_dim_: this is the function to create a device mesh, it's quite implicit!

On the other hand, I want to explicitly mention pg_ or group_ or something to clarify it's not the options of the mesh dim itself, but of the underlying ProcessGroup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it useless to prepend mesh_dim_: this is the function to create a device mesh, it's quite implicit!

Sure that make sense! I was originally thinking to align it with mesh_dim_names, but I agree it's redundant. Shall we simply call it options? Maybe we should change mesh_dim_names to dim_names or simply names later

On the other hand, I want to explicitly mention pg_ or group_ or something to clarify it's not the options of the mesh dim itself, but of the underlying ProcessGroup.

I wonder what does you mean by the options of the mesh dim itself? I thought there's nothing can be configured (or have options) on a mesh dim except its underlying communicator have options, so it would be quite explicit to user? I am ok with group_options, but I think if there's no confusion about options we should just go with options?

@lw
Copy link
Contributor Author

lw commented Jul 30, 2025

I'm not sure I agree with the comments on UX.

  • True, this might add a burden if one wants to specify only the backend type or only the options, but a given options object only works for one certain backend type (e.g., ProcessGroupNCCL.Options(...) requires "nccl" as backend type), thus it doesn't seem that burdensome to ask the user to be explicit about it. Similarly, if one wants to specify only the backend type, adding None as a second argument doesn't seem too hard.
  • However, your proposal makes it much more verbose and redundant if one wants to give both a backend type and an options. Currently it would be pg_backend_and_options={"tp": ("nccl", opts)}, with your proposal it would be pg_backend={"tp": "nccl"}, pg_options={"tp": opts}, thus increasing the error-proneness if the two get out-of-sync.

What I can propose is that we allow the values of the dict to also be simple strs or simple objects of (a subtype of) C10dBackend.Options, and if so we automatically turn them into a tuple by autofilling the other half with None.

@wanchaol
Copy link
Collaborator

wanchaol commented Jul 30, 2025

  • True, this might add a burden if one wants to specify only the backend type or only the options, but a given options object only works for one certain backend type (e.g., ProcessGroupNCCL.Options(...) requires "nccl" as backend type), thus it doesn't seem that burdensome to ask the user to be explicit about it. Similarly, if one wants to specify only the backend type, adding None as a second argument doesn't seem too hard.

I think the problem of the blending the concept of backend and pg option is that: many times user actually only want to specify the backend, or only want to specify the options, there're very rare chance that user want to specify them together, so IMO we should try to make the common case experience be better, i.e. what if user just pass pg options, they don't want to specify the backend? when specifying options for the multi-dim mesh the backend needs to be specified again and again.

  • However, your proposal makes it much more verbose and redundant if one wants to give both a backend type and an options. Currently it would be pg_backend_and_options={"tp": ("nccl", opts)}, with your proposal it would be pg_backend={"tp": "nccl"}, pg_options={"tp": opts}, thus increasing the error-proneness if the two get out-of-sync.

I wonder how it make things much verbose if one want to give both a backend type and an options? My proposal is to not make a separate pg_backend argument, but rather encode it directly in the device_type, this is similar to how the processgroup backend specification done. I think it is actually simpler than the current UX, the proposed UX would be sth like this

init_device_mesh("cuda:nccl", mesh_shape, options={"tp": opts})

In current way it would be:

init_device_mesh("cuda", mesh_shape, pg_backend_and_options={"tp": ("nccl", opts)})

I think it would become even simpler if user want to specify the pg options to multiple dimension of the device mesh, i.e. the proposed UX would be:

init_device_mesh(
    "cuda:nccl",
    mesh_shape,
    options={"tp": tp_opts, "dp": dp_opts, "cp": cp_opts}
)

In the current way it would be:

init_device_mesh(
    "cuda",
    mesh_shape,
    pg_backend_and_options={"tp": ("nccl", tp_opts), "dp": ("nccl", dp_opts), "cp": ("nccl", cp_opts)}
)

It seems to me that repeating the "nccl" backend and with nested dict looks cumbersome. There should not be out of sync issue as the backend is parsed from device_type and we can easily guard it.

@wconstab
Copy link
Contributor

here's a clarifying question:

how important is it to support a different backend type per dimension? if so, then this version is limiting
init_device_mesh("cuda:nccl", because it implies one backend type shared by all the mesh dims. I'm not sure if @lw is wanting this but i think its getting more common (e.g. torchft) to want a special type of backend for just one dimension.

@wanchaol
Copy link
Collaborator

how important is it to support a different backend type per dimension? if so, then this version is limiting init_device_mesh("cuda:nccl", because it implies one backend type shared by all the mesh dims. I'm not sure if @lw is wanting this but i think its getting more common (e.g. torchft) to want a special type of backend for just one dimension.

@wconstab I am not sure about how torchft exactly wants to change the backend for each dimension, or I think it's not very clear to me that even having different backend on each dimension make sense, i.e. the way to create subgroup is to inherit the backend of the world group (and need to rely on the backend way to split the group, i.e, ncclSplit), and many operations of device mesh today (flatten, the upcoming split) requires to have the same backend, otherwise the operation are very much undefined.

@wconstab
Copy link
Contributor

I think we better not assume the world group is materialized. It may be important to have it for some jobs, but it is also too expensive to create a world in larger scale jobs. So I prefer the direction of decoupling the top level world from the first layer of dims. And in this case I think it can make sense to allow separate dims to use different backends.

But I also agree the semantic of split should have a defined meaning corresponding to a pg split.

@wanchaol
Copy link
Collaborator

I think we better not assume the world group is materialized. It may be important to have it for some jobs, but it is also too expensive to create a world in larger scale jobs. So I prefer the direction of decoupling the top level world from the first layer of dims. And in this case I think it can make sense to allow separate dims to use different backends.

Yeah sure, I am not assuming the world group is materialized, and I think it's probably not a good idea to even materialize the world pg at all when it's multi-dim mesh. But even though it don't materialize the world pg, we should avoid undefined behaviors as much as possible (i.e., what does flatten a two dimension with completely different backends even mean?). Once a real communicator formed then the all other connection in the same device mesh should use the same backend, otherwise it's not really "device mesh" as devices are not connected together via the same type of network communicators.

If someone really want to have a separate backend for some certain ranks/dims, imo they can either manage it as a separate process group by hand, or manage a separate DeviceMesh that created directly by the DeviceMesh constructor (instead of init_device_mesh)

@lw
Copy link
Contributor Author

lw commented Jul 31, 2025

  • There are use cases where different dims should use different backends (e.g., intra- and inter-node)
  • The root PG might not be "materialized", but we must count on it having a backend type and some options that we can copy

I'm gonna make a call in order to close this discussion, which is to proceed as follow: I'll rename the argument to pg_override and make it accept either a (type, opts) tuple, but also just the type (str) or the options (C10dBackend.Options). I believe this should make everyone happy.

[ghstack-poisoned]
lw added a commit that referenced this pull request Jul 31, 2025
Comment on lines +23 to +27
struct Options : Backend::Options {
explicit Options() : Backend::Options("fake") {}

int fake_option = 0;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to define options for the fake PG in order to test the options-only override

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this is fine. Maybe later we can replace it with a real option


def init_pg(self, eager_init) -> None:
if "nccl" in self.backend and torch.cuda.device_count() < self.world_size:
def init_pg(self, eager_init, backend: Optional[str] = None) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to modify this (and below in this file) in order to allow using the "fake" backend as the root PG



def _create_fake_pg(prefix_store, rank, world_size, timeout):
def _create_fake_pg(common_opts, backend_opts):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using the "extended API" as otherwise the options wouldn't get passed to the FakePG constructor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering it this will cause any bc breaking for some tests?

[ghstack-poisoned]
@lw
Copy link
Contributor Author

lw commented Jul 31, 2025

I've added support for the pg override to _flatten as well

lw added a commit that referenced this pull request Jul 31, 2025
self.device_type,
(2, 2, 2),
mesh_dim_names=("dp", "tp", "cp"),
pg_override={0: "fake", 2: ("fake", opts)},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what if user create a 4D or 5D device mesh, and they want to use their custom backend (i.e. even the fake backend testing here), but they don't need to customize the options for device mesh dimensions? This is pretty common for non-cuda backend vendors, with the current API user would need to do sth like this:

        mesh = init_device_mesh(
            self.device_type,
            (2, 2, 2, 2, 2),
            mesh_dim_names=("pp", "dp", "tp", "cp", "ep"),
            pg_override={0: "fake", 1: "fake", 2: "fake", 3: "fake", 4: "fake"}
)

As a user, i feel this is still very cumbersome, but if I don't do this, some of the mesh dimension would just silently use the "default" backend for the current device type..

@wanchaol
Copy link
Collaborator

I'm gonna make a call in order to close this discussion, which is to proceed as follow: I'll rename the argument to pg_override and make it accept either a (type, opts) tuple, but also just the type (str) or the options (C10dBackend.Options). I believe this should make everyone happy.

@lw Can we start exposing this argument as a private API (both in init_device_mesh and flatten)? I don't want to block any of your work, and I feel it would be good to listen to user feedbacks about the UX first (just like we did for the _flatten API), and if the community feel it's good then let's make it public.

@lw
Copy link
Contributor Author

lw commented Aug 1, 2025

what if user create a 4D or 5D device mesh, and they want to use their custom backend (i.e. even the fake backend testing here), but they don't need to customize the options for device mesh dimensions

I don't understand if this is a use-case you're hitting in practice or if you're imagining corner cases where the API would be awkward to use? The new API addresses what you asked earlier (one can provide just the options if one so chooses), but now you're evoking a case where one does want to change the backend?

In order to pragmatically address your example use-case, I think the user should just use fake as their primary backend when calling init_process_group, and then they wouldn't have to pass any override. If for whatever reason they want the device mesh to use a different root/default backend than c10d, we could envision adding a new optional backend=... argument to init_device_mesh, which acts as a default for all dimensions. I suggest we only do that once we have a compelling use-case.

Note also that no API in PyTorch is truly private. No matter whether there's a leading underscore, if that's the only way to achieve what people need, they will use it in their downstream project and PyTorch won't be able to remove it or break it. Case in point: the _flatten method is supposedly private but we're using it in our codebase and we'd be quite annoyed if it changed.

Do you have a concrete proposal that addresses your concerns? If not, I'll probably merge this PR as-is.

@wanchaol
Copy link
Collaborator

wanchaol commented Aug 1, 2025

I don't understand if this is a use-case you're hitting in practice or if you're imagining corner cases where the API would be awkward to use? The new API addresses what you asked earlier (one can provide just the options if one so chooses), but now you're evoking a case where one does want to change the backend?

@lw To be clear, my original comment is about both: "user actually only want to specify the backend, or only want to specify the options". The new API only address one of the ask, so I am not sure why you think I evoke a new case, as I am just discussing the existing ones. The case that user only specify the backend is pretty common, i.e. for HCCL users, or any users who implemented custom process group and register as a backend (we have one so it's something we would hit with the current API).

Note also that no API in PyTorch is truly private. No matter whether there's a leading underscore, if that's the only way to achieve what people need, they will use it in their downstream project and PyTorch won't be able to remove it or break it. Case in point: the _flatten method is supposedly private but we're using it in our codebase and we'd be quite annoyed if it changed.

The private API is mostly for developers to have the freedom of adapt/change the API depending on new usecases/feedbacks before making it stable, otherwise every BC breaking change need to go through 3 release cycle. It's also a signal to users that PyTorch can remove/break the API so use at risk. If we don't have private APIs then we don't need to have prototype features, everything can be public and develop/use it prod, I don't think PyTorch want to be in that way.

Do you have a concrete proposal that addresses your concerns? If not, I'll probably merge this PR as-is.

I already gave my suggestion in my previous comments, basically we can embed the backend as a substring of the device_type, i.e. cuda:nccl or cuda:fake, this aligns with the ProcessGroup dispatchable backend API style. When that's specified we would by default initialize all process group with that backend, and if the pg_override override one dimension's backend, only that dimension would be overridden but still all other dimensions would still use the global backend user specified, instead of the default one.

[ghstack-poisoned]
lw added a commit that referenced this pull request Aug 5, 2025
@lw
Copy link
Contributor Author

lw commented Aug 5, 2025

Thanks everyone!

@lw
Copy link
Contributor Author

lw commented Aug 5, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Check mergeability of ghstack PR / ghstack-mergeability-check

Details for Dev Infra team Raised by workflow job

@lw
Copy link
Contributor Author

lw commented Aug 5, 2025

@pytorchbot merge -f "Failures due to CI SEV #159825"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/lw/1/head branch September 5, 2025 02:09
fduwjj added a commit that referenced this pull request Oct 8, 2025
We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.

Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 8, 2025
…group_options API"


We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.

Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 8, 2025
We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.

Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 8, 2025
We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.

Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.

Pull Request resolved: #164750
Approved by: https://github.com/lw, https://github.com/fegin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: DeviceMesh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants