Allow controlling PG backend and options via init_device_mesh #159371

lw · 2025-07-29T15:00:18Z

Stack from ghstack (oldest at bottom):

-> Allow controlling PG backend and options via init_device_mesh #159371

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-07-29T15:00:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159371

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

❌ 2 New Failures, 1 Unrelated Failure

As of commit 9927ca7 with merge base 908c5cc ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
RuntimeError: GraphQL query
Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: GraphQL query

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 850a2b0 Pull-Request: #159371

[ghstack-poisoned]

ghstack-source-id: ac3798f Pull-Request: #159371

wconstab

lgtm! though it would be nice to also test that the options gets passed correctly. (or maybe this is covered by existing tests?)

fduwjj

Thanks for the fix and leave some comments. Stamp to unblock

fduwjj · 2025-07-29T16:31:40Z

torch/distributed/device_mesh.py

                    # Inherit from the parent group if no options are specified for the group.
                    if dim in _mesh_resources.mesh_dim_group_options:
+                        if pg_backend_and_options[dim] != (None, None):
+                            raise RuntimeError(


hmm maybe we can let pg_backend_and_options to override what the mesh has set already? No?

I didn't want to worry right now about which of the two gets priority if they're both specified, so I decided to throw in such cases. Do you have an argument for why the argument should take preference over the global?

Well I think the argument is that if user passed in an option explicitly that means overriding existing one?

fduwjj · 2025-07-29T16:33:06Z

torch/distributed/device_mesh.py

+                    if mesh_dim_name in pg_backend_and_options:
+                        if mesh_dim_idx in pg_backend_and_options:


nit: you might want to consolidate these two into one if?

Either we do (like now):

if a: if b: raise do()

or (as you suggest)

if a and b: raise if a: do()

I prefer the first one since it doesn't repeat a twice.

fduwjj · 2025-07-29T16:36:24Z

torch/distributed/device_mesh.py

        mesh_shape: tuple[int, ...],
        *,
        mesh_dim_names: Optional[tuple[str, ...]] = None,
+        pg_backend_and_options: Optional[


can you kindly add a comment in L1040 to say that the key here can be a dim and a dim name?

fduwjj · 2025-07-29T16:40:30Z

torch/distributed/device_mesh.py

+        if pg_backend_and_options is not None:
+            pg_backend_and_options_tuple = tuple(
+                pg_backend_and_options.pop(i, (None, None))
+                for i in range(len(mesh_shape))
+            )
+            if pg_backend_and_options:
+                raise RuntimeError(
+                    f"Found invalid keys in pg_backend_and_options: got {list(pg_backend_and_options.keys())}, "
+                    f"expected integers in range [0, {len(mesh_shape)}) or one of {mesh_dim_names or []}"
+                )
+        else:
+            pg_backend_and_options_tuple = None


any chance we can move this logic into the previous for-loop? Looks like we are looping the mesh twice?

The previous for-loop is only entered if mesh_dim_names is not None. So we could merge this into a single loop, but this loop would then have to do redundant checks on mesh_dim_names, and that would also be redundant (in a different way).

wanchaol

Thanks for adding the feature! I have some suggestions on how the UX should be look like, requesting changes to align on the UX.

My opinion is that we should separate the backend selection and the pg options, instead of blending them together, this would make the argument more concise. Here is the proposed UX:

# switch between different backends to be controlled by the device_type argument, this is aligned with how process group backend selection works
init_device_mesh("cuda:nccl", ..)
# Specify backend options can be passed in by sth like
init_device_mesh("cuda:nccl", mesh_shape, mesh_dim_names=("dp, "tp), mesh_dim_options={"dp": nccl_pg_options})

Basically mesh_dim_options should be a type of Dict[str, C10dBackend.Options], I think we can support str as key only but adding int together also sounds good.

wanchaol · 2025-07-29T21:30:19Z

torch/distributed/device_mesh.py

        mesh_shape: tuple[int, ...],
        *,
        mesh_dim_names: Optional[tuple[str, ...]] = None,
+        pg_backend_and_options: Optional[


hmmm I would prefer to:

separate the pg backend and pg options, these should be separately controlled IMO

name this as mesh_dim_options instead.

I find it useless to prepend mesh_dim_: this is the function to create a device mesh, it's quite implicit!

On the other hand, I want to explicitly mention pg_ or group_ or something to clarify it's not the options of the mesh dim itself, but of the underlying ProcessGroup.

I find it useless to prepend mesh_dim_: this is the function to create a device mesh, it's quite implicit!

Sure that make sense! I was originally thinking to align it with mesh_dim_names, but I agree it's redundant. Shall we simply call it options? Maybe we should change mesh_dim_names to dim_names or simply names later

On the other hand, I want to explicitly mention pg_ or group_ or something to clarify it's not the options of the mesh dim itself, but of the underlying ProcessGroup.

I wonder what does you mean by the options of the mesh dim itself? I thought there's nothing can be configured (or have options) on a mesh dim except its underlying communicator have options, so it would be quite explicit to user? I am ok with group_options, but I think if there's no confusion about options we should just go with options?

lw · 2025-07-30T12:45:16Z

I'm not sure I agree with the comments on UX.

True, this might add a burden if one wants to specify only the backend type or only the options, but a given options object only works for one certain backend type (e.g., ProcessGroupNCCL.Options(...) requires "nccl" as backend type), thus it doesn't seem that burdensome to ask the user to be explicit about it. Similarly, if one wants to specify only the backend type, adding None as a second argument doesn't seem too hard.
However, your proposal makes it much more verbose and redundant if one wants to give both a backend type and an options. Currently it would be pg_backend_and_options={"tp": ("nccl", opts)}, with your proposal it would be pg_backend={"tp": "nccl"}, pg_options={"tp": opts}, thus increasing the error-proneness if the two get out-of-sync.

What I can propose is that we allow the values of the dict to also be simple strs or simple objects of (a subtype of) C10dBackend.Options, and if so we automatically turn them into a tuple by autofilling the other half with None.

wanchaol · 2025-07-30T18:59:00Z

True, this might add a burden if one wants to specify only the backend type or only the options, but a given options object only works for one certain backend type (e.g., ProcessGroupNCCL.Options(...) requires "nccl" as backend type), thus it doesn't seem that burdensome to ask the user to be explicit about it. Similarly, if one wants to specify only the backend type, adding None as a second argument doesn't seem too hard.

I think the problem of the blending the concept of backend and pg option is that: many times user actually only want to specify the backend, or only want to specify the options, there're very rare chance that user want to specify them together, so IMO we should try to make the common case experience be better, i.e. what if user just pass pg options, they don't want to specify the backend? when specifying options for the multi-dim mesh the backend needs to be specified again and again.

However, your proposal makes it much more verbose and redundant if one wants to give both a backend type and an options. Currently it would be pg_backend_and_options={"tp": ("nccl", opts)}, with your proposal it would be pg_backend={"tp": "nccl"}, pg_options={"tp": opts}, thus increasing the error-proneness if the two get out-of-sync.

I wonder how it make things much verbose if one want to give both a backend type and an options? My proposal is to not make a separate pg_backend argument, but rather encode it directly in the device_type, this is similar to how the processgroup backend specification done. I think it is actually simpler than the current UX, the proposed UX would be sth like this

init_device_mesh("cuda:nccl", mesh_shape, options={"tp": opts})

In current way it would be:

init_device_mesh("cuda", mesh_shape, pg_backend_and_options={"tp": ("nccl", opts)})

I think it would become even simpler if user want to specify the pg options to multiple dimension of the device mesh, i.e. the proposed UX would be:

init_device_mesh(
    "cuda:nccl",
    mesh_shape,
    options={"tp": tp_opts, "dp": dp_opts, "cp": cp_opts}
)

In the current way it would be:

init_device_mesh(
    "cuda",
    mesh_shape,
    pg_backend_and_options={"tp": ("nccl", tp_opts), "dp": ("nccl", dp_opts), "cp": ("nccl", cp_opts)}
)

It seems to me that repeating the "nccl" backend and with nested dict looks cumbersome. There should not be out of sync issue as the backend is parsed from device_type and we can easily guard it.

wconstab · 2025-07-30T20:26:49Z

here's a clarifying question:

how important is it to support a different backend type per dimension? if so, then this version is limiting
init_device_mesh("cuda:nccl", because it implies one backend type shared by all the mesh dims. I'm not sure if @lw is wanting this but i think its getting more common (e.g. torchft) to want a special type of backend for just one dimension.

wanchaol · 2025-07-30T23:44:11Z

how important is it to support a different backend type per dimension? if so, then this version is limiting init_device_mesh("cuda:nccl", because it implies one backend type shared by all the mesh dims. I'm not sure if @lw is wanting this but i think its getting more common (e.g. torchft) to want a special type of backend for just one dimension.

@wconstab I am not sure about how torchft exactly wants to change the backend for each dimension, or I think it's not very clear to me that even having different backend on each dimension make sense, i.e. the way to create subgroup is to inherit the backend of the world group (and need to rely on the backend way to split the group, i.e, ncclSplit), and many operations of device mesh today (flatten, the upcoming split) requires to have the same backend, otherwise the operation are very much undefined.

wconstab · 2025-07-31T01:40:13Z

I think we better not assume the world group is materialized. It may be important to have it for some jobs, but it is also too expensive to create a world in larger scale jobs. So I prefer the direction of decoupling the top level world from the first layer of dims. And in this case I think it can make sense to allow separate dims to use different backends.

But I also agree the semantic of split should have a defined meaning corresponding to a pg split.

wanchaol · 2025-07-31T03:05:59Z

I think we better not assume the world group is materialized. It may be important to have it for some jobs, but it is also too expensive to create a world in larger scale jobs. So I prefer the direction of decoupling the top level world from the first layer of dims. And in this case I think it can make sense to allow separate dims to use different backends.

Yeah sure, I am not assuming the world group is materialized, and I think it's probably not a good idea to even materialize the world pg at all when it's multi-dim mesh. But even though it don't materialize the world pg, we should avoid undefined behaviors as much as possible (i.e., what does flatten a two dimension with completely different backends even mean?). Once a real communicator formed then the all other connection in the same device mesh should use the same backend, otherwise it's not really "device mesh" as devices are not connected together via the same type of network communicators.

If someone really want to have a separate backend for some certain ranks/dims, imo they can either manage it as a separate process group by hand, or manage a separate DeviceMesh that created directly by the DeviceMesh constructor (instead of init_device_mesh)

lw · 2025-07-31T11:56:48Z

There are use cases where different dims should use different backends (e.g., intra- and inter-node)
The root PG might not be "materialized", but we must count on it having a backend type and some options that we can copy

I'm gonna make a call in order to close this discussion, which is to proceed as follow: I'll rename the argument to pg_override and make it accept either a (type, opts) tuple, but also just the type (str) or the options (C10dBackend.Options). I believe this should make everyone happy.

[ghstack-poisoned]

ghstack-source-id: f71364e Pull-Request: #159371

lw · 2025-07-31T14:25:00Z

torch/csrc/distributed/c10d/FakeProcessGroup.hpp

+  struct Options : Backend::Options {
+    explicit Options() : Backend::Options("fake") {}
+
+    int fake_option = 0;
+  };


Needed to define options for the fake PG in order to test the options-only override

i guess this is fine. Maybe later we can replace it with a real option

lw · 2025-07-31T14:25:53Z

torch/testing/_internal/distributed/_tensor/common_dtensor.py


-    def init_pg(self, eager_init) -> None:
-        if "nccl" in self.backend and torch.cuda.device_count() < self.world_size:
+    def init_pg(self, eager_init, backend: Optional[str] = None) -> None:


Had to modify this (and below in this file) in order to allow using the "fake" backend as the root PG

lw · 2025-07-31T14:26:21Z

torch/testing/_internal/distributed/fake_pg.py



-def _create_fake_pg(prefix_store, rank, world_size, timeout):
+def _create_fake_pg(common_opts, backend_opts):


Switched to using the "extended API" as otherwise the options wouldn't get passed to the FakePG constructor

I am wondering it this will cause any bc breaking for some tests?

[ghstack-poisoned]

lw · 2025-07-31T16:57:22Z

I've added support for the pg override to _flatten as well

ghstack-source-id: 5f6817c Pull-Request: #159371

wanchaol · 2025-07-31T17:54:10Z

test/distributed/test_device_mesh.py

+            self.device_type,
+            (2, 2, 2),
+            mesh_dim_names=("dp", "tp", "cp"),
+            pg_override={0: "fake", 2: ("fake", opts)},


So what if user create a 4D or 5D device mesh, and they want to use their custom backend (i.e. even the fake backend testing here), but they don't need to customize the options for device mesh dimensions? This is pretty common for non-cuda backend vendors, with the current API user would need to do sth like this:

mesh = init_device_mesh( self.device_type, (2, 2, 2, 2, 2), mesh_dim_names=("pp", "dp", "tp", "cp", "ep"), pg_override={0: "fake", 1: "fake", 2: "fake", 3: "fake", 4: "fake"} )

As a user, i feel this is still very cumbersome, but if I don't do this, some of the mesh dimension would just silently use the "default" backend for the current device type..

wanchaol · 2025-07-31T18:00:49Z

I'm gonna make a call in order to close this discussion, which is to proceed as follow: I'll rename the argument to pg_override and make it accept either a (type, opts) tuple, but also just the type (str) or the options (C10dBackend.Options). I believe this should make everyone happy.

@lw Can we start exposing this argument as a private API (both in init_device_mesh and flatten)? I don't want to block any of your work, and I feel it would be good to listen to user feedbacks about the UX first (just like we did for the _flatten API), and if the community feel it's good then let's make it public.

lw · 2025-08-01T11:56:35Z

what if user create a 4D or 5D device mesh, and they want to use their custom backend (i.e. even the fake backend testing here), but they don't need to customize the options for device mesh dimensions

I don't understand if this is a use-case you're hitting in practice or if you're imagining corner cases where the API would be awkward to use? The new API addresses what you asked earlier (one can provide just the options if one so chooses), but now you're evoking a case where one does want to change the backend?

In order to pragmatically address your example use-case, I think the user should just use fake as their primary backend when calling init_process_group, and then they wouldn't have to pass any override. If for whatever reason they want the device mesh to use a different root/default backend than c10d, we could envision adding a new optional backend=... argument to init_device_mesh, which acts as a default for all dimensions. I suggest we only do that once we have a compelling use-case.

Note also that no API in PyTorch is truly private. No matter whether there's a leading underscore, if that's the only way to achieve what people need, they will use it in their downstream project and PyTorch won't be able to remove it or break it. Case in point: the _flatten method is supposedly private but we're using it in our codebase and we'd be quite annoyed if it changed.

Do you have a concrete proposal that addresses your concerns? If not, I'll probably merge this PR as-is.

wanchaol · 2025-08-01T17:32:03Z

I don't understand if this is a use-case you're hitting in practice or if you're imagining corner cases where the API would be awkward to use? The new API addresses what you asked earlier (one can provide just the options if one so chooses), but now you're evoking a case where one does want to change the backend?

@lw To be clear, my original comment is about both: "user actually only want to specify the backend, or only want to specify the options". The new API only address one of the ask, so I am not sure why you think I evoke a new case, as I am just discussing the existing ones. The case that user only specify the backend is pretty common, i.e. for HCCL users, or any users who implemented custom process group and register as a backend (we have one so it's something we would hit with the current API).

Note also that no API in PyTorch is truly private. No matter whether there's a leading underscore, if that's the only way to achieve what people need, they will use it in their downstream project and PyTorch won't be able to remove it or break it. Case in point: the _flatten method is supposedly private but we're using it in our codebase and we'd be quite annoyed if it changed.

The private API is mostly for developers to have the freedom of adapt/change the API depending on new usecases/feedbacks before making it stable, otherwise every BC breaking change need to go through 3 release cycle. It's also a signal to users that PyTorch can remove/break the API so use at risk. If we don't have private APIs then we don't need to have prototype features, everything can be public and develop/use it prod, I don't think PyTorch want to be in that way.

Do you have a concrete proposal that addresses your concerns? If not, I'll probably merge this PR as-is.

I already gave my suggestion in my previous comments, basically we can embed the backend as a substring of the device_type, i.e. cuda:nccl or cuda:fake, this aligns with the ProcessGroup dispatchable backend API style. When that's specified we would by default initialize all process group with that backend, and if the pg_override override one dimension's backend, only that dimension would be overridden but still all other dimensions would still use the global backend user specified, instead of the default one.

[ghstack-poisoned]

ghstack-source-id: 24abdcd Pull-Request: #159371

lw · 2025-08-05T09:37:12Z

Thanks everyone!

lw · 2025-08-05T09:37:20Z

@pytorchbot merge

pytorchmergebot · 2025-08-05T09:39:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-05T09:39:50Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Check mergeability of ghstack PR / ghstack-mergeability-check

Details for Dev Infra team

Raised by workflow job

lw · 2025-08-05T12:42:22Z

@pytorchbot merge -f "Failures due to CI SEV #159825"

pytorchmergebot · 2025-08-05T12:44:05Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol

…h#159371) Pull Request resolved: pytorch#159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol

We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on. Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…group_options API" We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on. Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on. Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

We allow passing in PG option via #159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on. Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable. Pull Request resolved: #164750 Approved by: https://github.com/lw, https://github.com/fegin

Update

248d92c

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 29, 2025

lw added a commit that referenced this pull request Jul 29, 2025

Allow controlling PG backend and options via init_device_mesh

d97853a

ghstack-source-id: 850a2b0 Pull-Request: #159371

lw added the release notes: DeviceMesh label Jul 29, 2025

Update

fa9b06c

[ghstack-poisoned]

lw added a commit that referenced this pull request Jul 29, 2025

Allow controlling PG backend and options via init_device_mesh

58dcbc1

ghstack-source-id: ac3798f Pull-Request: #159371

wconstab approved these changes Jul 29, 2025

View reviewed changes

fduwjj approved these changes Jul 29, 2025

View reviewed changes

wanchaol requested changes Jul 29, 2025

View reviewed changes

Update

a9a1968

[ghstack-poisoned]

lw added a commit that referenced this pull request Jul 31, 2025

Allow controlling PG backend and options via init_device_mesh

3b12d81

ghstack-source-id: f71364e Pull-Request: #159371

lw commented Jul 31, 2025

View reviewed changes

lw mentioned this pull request Jul 31, 2025

[RFC] Support more flexible pg option when it comes to submesh creation #159018

Open

Update

8da7730

[ghstack-poisoned]

lw added a commit that referenced this pull request Jul 31, 2025

Allow controlling PG backend and options via init_device_mesh

8072bda

ghstack-source-id: 5f6817c Pull-Request: #159371

wanchaol reviewed Jul 31, 2025

View reviewed changes

wanchaol approved these changes Aug 4, 2025

View reviewed changes

Update

9927ca7

[ghstack-poisoned]

lw added a commit that referenced this pull request Aug 5, 2025

Allow controlling PG backend and options via init_device_mesh

e1da3a3

ghstack-source-id: 24abdcd Pull-Request: #159371

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 5, 2025

pytorchmergebot added the merging label Aug 5, 2025

pytorchmergebot removed the merging label Aug 5, 2025

pytorchmergebot added the merging label Aug 5, 2025

pytorchmergebot closed this in aeb5321 Aug 5, 2025

pytorchmergebot added Merged and removed merging labels Aug 5, 2025

This was referenced Aug 5, 2025

[DeviceMesh] Add _unflatten_ api for device mesh to support better UX for some use cases like EP and replicate #159482

Closed

fix device mesh overrides meta-pytorch/torchft#254

Merged

laithsakka pushed a commit that referenced this pull request Aug 11, 2025

Allow controlling PG backend and options via init_device_mesh (#159371)

da748cb

Pull Request resolved: #159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol

fduwjj mentioned this pull request Aug 22, 2025

[device_mesh] Implement _unflatten on top of CuTe layout bookkeeping #161224

Closed

github-actions bot deleted the gh/lw/1/head branch September 5, 2025 02:09

This was referenced Sep 29, 2025

[DeviceMesh] Simplifying internal bookkeeping with CuTe layout #163213

Closed

[DeviceMesh] Move global state into class method #164510

Closed

fduwjj mentioned this pull request Oct 6, 2025

[DeviceMesh] Remove private _set_mesh_dim_group_options API #164750

Closed

		if mesh_dim_name in pg_backend_and_options:
		if mesh_dim_idx in pg_backend_and_options:



		def _create_fake_pg(prefix_store, rank, world_size, timeout):
		def _create_fake_pg(common_opts, backend_opts):

Allow controlling PG backend and options via init_device_mesh #159371

Allow controlling PG backend and options via init_device_mesh #159371

Uh oh!

Conversation

lw commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159371

❗ 1 Active SEVs

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw commented Jul 30, 2025

Uh oh!

wanchaol commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab commented Jul 30, 2025

Uh oh!

wanchaol commented Jul 30, 2025

Uh oh!

wconstab commented Jul 31, 2025

Uh oh!

wanchaol commented Jul 31, 2025

Uh oh!

lw commented Jul 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw commented Jul 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol commented Jul 31, 2025

Uh oh!

lw commented Aug 1, 2025

Uh oh!

wanchaol commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

lw commented Jul 29, 2025 •

edited

Loading

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

wanchaol left a comment •

edited

Loading

wanchaol commented Jul 30, 2025 •

edited

Loading

wanchaol commented Aug 1, 2025 •

edited

Loading