Skip to content

Conversation

yifuwang
Copy link
Collaborator

@yifuwang yifuwang commented Apr 10, 2024

Stack from ghstack (oldest at bottom):

Summary

After this PR, the functional collective Python APIs will stop honoring TORCH_DISABLE_NATIVE_FUNCOL and only use native funcol ops. Specifically, this PR:

  • Removed use_native_funcol().
  • Removed the code path in the Python APIs when use_native_funcol() is False.
  • Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

Test Changes

test_functional_api.py

  • Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
  • Removed TestWaitiness which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in test/distributed/test_c10d_functional_native.py
    assert not output.completed
    assert output.eq(expect).all()
    assert output.completed

test/distributed/_tensor/test_dtensor.py
test/distributed/_tensor/test_dtensor_compile.py
test/distributed/test_device_mesh.py
test/distributed/_tensor/experimental/test_tp_transform.py
test/distributed/_tensor/test_matrix_ops.py
test/distributed/test_inductor_collectives.py

  • All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

test/distributed/test_c10d_functional_native.py

  • Removed the run_with_native_funcol decorators.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

WIP
[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Apr 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123777

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 655f2ba with merge base 585cd11 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 10, 2024
@yifuwang yifuwang requested review from wanchaol and wconstab April 10, 2024 22:12
@yifuwang yifuwang marked this pull request as ready for review April 10, 2024 22:12
@yifuwang yifuwang marked this pull request as draft April 10, 2024 22:13
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang

[ghstack-poisoned]
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Apr 10, 2024
WIP
ghstack-source-id: 0653689
Pull Request resolved: #123777
@yifuwang yifuwang changed the title WIP [functional collective] change the Python API to only use the native funcol ops Apr 11, 2024
@yifuwang yifuwang changed the title [functional collective] change the Python API to only use the native funcol ops [functional collective] change the Python APIs to only use the native funcol ops Apr 11, 2024
@yifuwang yifuwang marked this pull request as ready for review April 11, 2024 21:14
@yifuwang yifuwang requested review from wanchaol and wconstab April 11, 2024 21:14
… the native funcol ops"


## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` https://github.com/pytorch/pytorch/blob/b7fac76fc259394136bc77b3e39d5705919e5c4c/test/distributed/test_c10d_functional_native.py#L114-L116

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang

[ghstack-poisoned]
@yifuwang yifuwang requested a review from yf225 April 11, 2024 21:15
yifuwang pushed a commit that referenced this pull request Apr 11, 2024
… funcol ops

ghstack-source-id: 2c2fe59
Pull Request resolved: #123777
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see this happens!

@yifuwang
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 12, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x ede1ca6ebf63c15b052d1b8fb5efb1359c553e41 returned non-zero exit code 1

Auto-merging test/distributed/test_functional_api.py
CONFLICT (content): Merge conflict in test/distributed/test_functional_api.py
Auto-merging torch/distributed/_functional_collectives.py
error: could not apply ede1ca6ebf6... [functional collective] change the Python APIs to only use the native funcol ops
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

… the native funcol ops"


## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` https://github.com/pytorch/pytorch/blob/b7fac76fc259394136bc77b3e39d5705919e5c4c/test/distributed/test_c10d_functional_native.py#L114-L116

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Apr 13, 2024
… funcol ops

ghstack-source-id: df89ecc
Pull Request resolved: #123777
@yifuwang yifuwang added the topic: not user facing topic category label Apr 13, 2024
@yifuwang
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
… funcol ops (pytorch#123777)

## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` https://github.com/pytorch/pytorch/blob/b7fac76fc259394136bc77b3e39d5705919e5c4c/test/distributed/test_c10d_functional_native.py#L114-L116

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

Pull Request resolved: pytorch#123777
Approved by: https://github.com/wanchaol
ghstack dependencies: pytorch#123776
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
… funcol ops (pytorch#123777)

## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` https://github.com/pytorch/pytorch/blob/b7fac76fc259394136bc77b3e39d5705919e5c4c/test/distributed/test_c10d_functional_native.py#L114-L116

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

Pull Request resolved: pytorch#123777
Approved by: https://github.com/wanchaol
ghstack dependencies: pytorch#123776
@github-actions github-actions bot deleted the gh/yifuwang/78/head branch May 14, 2024 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants