Skip to content

Conversation

fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Jul 29, 2025

Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to deviceTypeToBackend_ which is set after setBackend. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

Rollback Plan:

Differential Revision: D79201132

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 29, 2025
Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159429

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 0520b21 with merge base 1465757 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jul 29, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79201132

@fduwjj fduwjj force-pushed the export-D79201132 branch from b741186 to f9789c0 Compare July 30, 2025 04:45
fduwjj added a commit to fduwjj/pytorch that referenced this pull request Jul 30, 2025
…remote_group` (pytorch#159429)

Summary:

We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx_persistent -- DeviceMeshTest
```


```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge
```

Rollback Plan:

Differential Revision: D79201132
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79201132

@fduwjj fduwjj requested a review from xunnanxu July 30, 2025 04:49
@fduwjj fduwjj force-pushed the export-D79201132 branch from f9789c0 to 7d8df10 Compare July 30, 2025 05:01
fduwjj added a commit to fduwjj/pytorch that referenced this pull request Jul 30, 2025
…remote_group` (pytorch#159429)

Summary:

We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx_persistent -- DeviceMeshTest
```


```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge
```

Rollback Plan:

Differential Revision: D79201132
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79201132

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 30, 2025
@xunnanxu
Copy link
Contributor

LG. Thanks for the fix!

@fduwjj fduwjj force-pushed the export-D79201132 branch from 7d8df10 to 8f5c039 Compare July 30, 2025 14:04
fduwjj added a commit to fduwjj/pytorch that referenced this pull request Jul 30, 2025
…remote_group` (pytorch#159429)

Summary:

We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan: CI

Reviewed By: xunnanxu

Differential Revision: D79201132
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79201132

…remote_group` (pytorch#159429)

Summary:

We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan: CI

Reviewed By: xunnanxu

Differential Revision: D79201132
@fduwjj fduwjj force-pushed the export-D79201132 branch from 8f5c039 to 0520b21 Compare July 30, 2025 14:21
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79201132

@fduwjj fduwjj changed the title [c10d] Fix group_split [c10d] Fix setGroupName and setGroupDesc in group_split and merge_remote_group Jul 30, 2025
@fduwjj
Copy link
Contributor Author

fduwjj commented Jul 30, 2025

The inductor test failed with "RuntimeError: Expected to find "buf7 = empty" but did not find it" which is not related to this PR. We don't interfere with how Triton kernel is generated and split_group is not used there as well.

@fduwjj
Copy link
Contributor Author

fduwjj commented Jul 30, 2025

@pytorchbot merge -f "failure test not related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
…remote_group` (#159429)

Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

Rollback Plan:

Differential Revision: D79201132

Pull Request resolved: #159429
Approved by: https://github.com/xunnanxu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants