Multiple fixes for functional collectives. #95897

kumpera · 2023-03-02T19:05:26Z

_functional_collectives.py: Ensure we always wait all collectives.
derivatives.yaml: mark all_reduce as non differentiable
gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT
common_dtensor.py: replace dist.barrier with all_reduce

pytorch-bot · 2023-03-02T19:05:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95897

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 82e9af2:

BROKEN TRUNK - The following jobs failed but were present on the merge base 004bcff:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-focal-cpu-py3.8-gcc7-inductor / test (inductor_timm_cpu_accuracy, 2, 2, linux.4xlarge) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tools/autograd/gen_variable_type.py

torch/distributed/_functional_collectives.py

facebook-github-bot · 2023-03-02T19:12:59Z

@kumpera has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-03-02T22:47:47Z

@kumpera has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wconstab · 2023-03-03T00:09:15Z

oops @wanchao already landed reduce_scatter, so that op also needs the derivatives.yml and codegen fix.

you can do it in a separate PR if its easier to land this first

fegin

Thanks for the fix!

kumpera · 2023-03-03T15:50:48Z

@pytorchmergebot merge

pytorchmergebot · 2023-03-03T15:52:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-03-03T15:52:40Z

Merge failed

Reason: 43 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

_functional_collectives.py: Ensure we always wait all collectives. derivatives.yaml: mark all_reduce as non differentiable gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT common_dtensor.py: replace dist.barrier with all_reduce

facebook-github-bot · 2023-03-03T23:06:20Z

@kumpera has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kumpera · 2023-03-06T15:33:16Z

@pytorchmergebot merge -f "the inductor failure is unrelated"

pytorchmergebot · 2023-03-06T15:34:59Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

_functional_collectives.py: Ensure we always wait all collectives. derivatives.yaml: mark all_reduce as non differentiable gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT common_dtensor.py: replace dist.barrier with all_reduce Pull Request resolved: pytorch/pytorch#95897 Approved by: https://github.com/wconstab, https://github.com/fegin

_functional_collectives.py: Ensure we always wait all collectives. derivatives.yaml: mark all_reduce as non differentiable gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT common_dtensor.py: replace dist.barrier with all_reduce Pull Request resolved: pytorch#95897 Approved by: https://github.com/wconstab, https://github.com/fegin

…arts of the codebase (#96460) Recent master breakage on focal and bionic PTD tests since we switched to all_reduce in #95897 Pull Request resolved: #96460 Approved by: https://github.com/fegin

kumpera requested review from H-Huang, albanD, awgu, fegin, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners March 2, 2023 19:05

kumpera requested a review from soulitzer as a code owner March 2, 2023 19:05

wconstab reviewed Mar 2, 2023

View reviewed changes

tools/autograd/gen_variable_type.py Show resolved Hide resolved

wconstab reviewed Mar 2, 2023

View reviewed changes

torch/distributed/_functional_collectives.py Show resolved Hide resolved

wconstab approved these changes Mar 2, 2023

View reviewed changes

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 3, 2023

fegin approved these changes Mar 3, 2023

View reviewed changes

kumpera added topic: bug fixes topic category module: dtensor distributed tensor tag release notes: distributed (dtensor) release notes category labels Mar 3, 2023

Rodrigo Kumpera added 2 commits March 3, 2023 20:08

add other ops to the fix.

48a09e6

kumpera force-pushed the fix_stuff branch from 918260e to 48a09e6 Compare March 3, 2023 20:09

Update grad failure message.

82e9af2

github-actions bot added the ciflow/inductor label Mar 3, 2023

pytorchmergebot added the Merged label Mar 6, 2023

pytorchmergebot closed this in 5b2ab0d Mar 6, 2023

ZainRizvi mentioned this pull request Mar 9, 2023

DISABLED test_dtensor_op_db_linalg_diagonal_cpu_float32 (__main__.TestDTensorOpsCPU) #96454

Closed

kumpera mentioned this pull request Mar 9, 2023

Revert all_reduce workaround as it might be causing issues on other parts of the codebase #96460

Closed

Multiple fixes for functional collectives. #95897

Multiple fixes for functional collectives. #95897

Uh oh!

Conversation

kumpera commented Mar 2, 2023

Uh oh!

pytorch-bot bot commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95897

❌ 1 Failures

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Mar 2, 2023

Uh oh!

facebook-github-bot commented Mar 2, 2023

Uh oh!

wconstab commented Mar 3, 2023

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

kumpera commented Mar 3, 2023

Uh oh!

pytorchmergebot commented Mar 3, 2023

Merge started

Uh oh!

pytorchmergebot commented Mar 3, 2023

Merge failed

Uh oh!

facebook-github-bot commented Mar 3, 2023

Uh oh!

kumpera commented Mar 6, 2023

Uh oh!

pytorchmergebot commented Mar 6, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorch-bot bot commented Mar 2, 2023 •

edited

Loading