[tp] fix torch compile regression #111521

wanchaol · 2023-10-18T23:29:58Z

Stack from ghstack (oldest at bottom):

-> [tp] fix torch compile regression #111521

The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:

use the old default prepare_input/output
add the colwise/rowwise parallel test instead

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead [ghstack-poisoned]

pytorch-bot · 2023-10-18T23:30:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111521

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 7f29dec with merge base 5c39552 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead ghstack-source-id: 4b9faae80116d25f5525b99e651985e14691aeab Pull Request resolved: #111521

fduwjj · 2023-10-18T23:36:48Z

torch/distributed/tensor/parallel/style.py

-        _prepare_input=None,
-        _prepare_output=None,


But this somehow revert the input_layouts and output_layouts? So does this mean when users want to use input_layouts and output_layouts, they have to set _prepare_input to None?

This is just a quick fix, I'll do a big rewrite on top to make sure user behavior won't change

OK I gave up on the big rewrite and found a quicker solution right now, however it does not fundamentally fix the torch.compile problem in the new style input_layouts and output_layouts path (i.e. to configure sequence parallel torch.compile still fail). We should do a full rewrite soon to make the TP code cleaner and tracing friendly

this "hack" preserves the intended behavior, but I dislike it..

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead [ghstack-poisoned]

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead ghstack-source-id: 0865d507b2debbd4bfa708a0ff7465a23e64f5de Pull Request resolved: #111521

wanchaol · 2023-10-19T00:27:35Z

@pytorchbot merge

pytorchmergebot · 2023-10-19T00:29:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-10-19T01:27:16Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead [ghstack-poisoned]

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead ghstack-source-id: 2dbb05c7bb3012e67af3e07778032ed1b52cc69d Pull Request resolved: #111521

wanchaol · 2023-10-19T05:23:47Z

@pytorchbot merge

pytorchmergebot · 2023-10-19T05:25:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

The most recent refactor of TP pytorch#111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead Pull Request resolved: pytorch#111521 Approved by: https://github.com/fduwjj

[tp] fix torch compile regression

a286053

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead [ghstack-poisoned]

wanchaol requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, fegin, fduwjj, wz337, kiukchung and d4l3k as code owners October 18, 2023 23:29

fduwjj reviewed Oct 18, 2023

View reviewed changes

fduwjj approved these changes Oct 18, 2023

View reviewed changes

Update on "[tp] fix torch compile regression"

5d92ebd

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead [ghstack-poisoned]

wanchaol added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (dtensor) release notes category labels Oct 19, 2023

pytorchmergebot added the merging label Oct 19, 2023

wanchaol added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 19, 2023

pytorchmergebot removed the merging label Oct 19, 2023

Update on "[tp] fix torch compile regression"

7f29dec

The most recent refactor of TP #111160 breaks torch compile path, so reverting the behavior back by: 1. use the old default prepare_input/output 2. add the colwise/rowwise parallel test instead [ghstack-poisoned]

pytorchmergebot added the merging label Oct 19, 2023

pytorchmergebot added Merged and removed merging labels Oct 19, 2023

pytorchmergebot closed this in 03e28bd Oct 19, 2023

facebook-github-bot deleted the gh/wanchaol/376/head branch October 22, 2023 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tp] fix torch compile regression #111521

[tp] fix torch compile regression #111521

wanchaol commented Oct 18, 2023 •

edited

pytorch-bot bot commented Oct 18, 2023 •

edited

fduwjj Oct 18, 2023

wanchaol Oct 18, 2023

fduwjj Oct 18, 2023

wanchaol Oct 19, 2023

wanchaol Oct 19, 2023

wanchaol commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

wanchaol commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

[tp] fix torch compile regression #111521

[tp] fix torch compile regression #111521

Conversation

wanchaol commented Oct 18, 2023 • edited

pytorch-bot bot commented Oct 18, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111521

✅ You can merge normally! (3 Unrelated Failures)

fduwjj Oct 18, 2023

Choose a reason for hiding this comment

wanchaol Oct 18, 2023

Choose a reason for hiding this comment

fduwjj Oct 18, 2023

Choose a reason for hiding this comment

wanchaol Oct 19, 2023

Choose a reason for hiding this comment

wanchaol Oct 19, 2023

Choose a reason for hiding this comment

wanchaol commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

Merge started

pytorchmergebot commented Oct 19, 2023

Merge failed

wanchaol commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

Merge started

wanchaol commented Oct 18, 2023 •

edited

pytorch-bot bot commented Oct 18, 2023 •

edited