[dynamo][moco] Disallow_in_graph distributed APIs #100071

anijain2305 · 2023-04-26T06:02:51Z

Stack from ghstack (oldest at bottom):

cc @soumith @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

[ghstack-poisoned]

pytorch-bot · 2023-04-26T06:02:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100071

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

download.pytorch.org flaky

❌ 3 New Failures

As of commit 16687a2:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 66e2cebe7268d9f475a3b1a4ce5389c63dc8980d Pull Request resolved: #100071

anijain2305 · 2023-04-26T06:04:43Z

@wconstab Wondering if you know of a better way here, instead of just graph breaking on them?

wconstab · 2023-04-26T16:58:35Z

Wondering if you know of a better way here, instead of just graph breaking on them?

No, we have to graph-break on them.

The missing piece is how to ensure we graph-break on the full set of them. I thought I stamped a PR (maybe from @yanboliang) a while back that skipped a large chunk of torch.distributed- looks like that did not land or else I can't find remnants of it.

We should disable for all of the collectives besides the new 'traceable/functional' ones (in _functional_collectives.py).

yanboliang · 2023-04-26T21:09:54Z

The missing piece is how to ensure we graph-break on the full set of them. I thought I stamped a PR (maybe from @yanboliang) a while back that skipped a large chunk of torch.distributed- looks like that did not land or else I can't find remnants of it.

Yes, that PR has been reverted multiple times, I finally use regex matching to skip torchrec.distribued(make it support torch.package), but only be used internally behind is_fbcode. I think we should skip these files right now, not only because they are collectives ops, but also they have memory leak issue due to their implementation.

ghstack-source-id: 66e2cebe7268d9f475a3b1a4ce5389c63dc8980d Pull Request resolved: #100071

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

ghstack-source-id: 25b947f582f7477f52b6d7d47e8b1d7785ba4d80 Pull Request resolved: #100071

anijain2305 · 2023-04-26T22:01:28Z

torch/distributed/distributed_c10d.py

+# There is no clear definition of torch.distributed ops. This helper set allows
+# TorchDynamo to selectively disallow all the distributed ops from the Fx
+# graphs.
+distributed_c10d_ops = set()


@wconstab Is this too hacky? Problem is there is no easy way to tell what's a torch "op" and whats not.

Maybe cc @H-Huang @kwen2501 -- do we expect a function to use the exception_handler decorator if and only if it is a c10d op?

hah yea its kinda hacky.

But otoh i think it's reasonable to have a set of dynamo_unsupported_distributed_c10d_ops and explicitly add them all. We should just get folks like @H-Huang to agree on the set of ops to initially tag.

ghstack-source-id: 25b947f582f7477f52b6d7d47e8b1d7785ba4d80 Pull Request resolved: #100071

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

ghstack-source-id: d0c52747c5f5b618733411021aa8245a5cd7bbde Pull Request resolved: #100071

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

ghstack-source-id: 0fe6edef674a88c22742b9fa57f453c3614c88ef Pull Request resolved: #100071

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

ghstack-source-id: 0fe6edef674a88c22742b9fa57f453c3614c88ef Pull Request resolved: #100071

H-Huang · 2023-04-27T20:39:32Z

torch/distributed/distributed_c10d.py

+
+# This ops are not friently to TorchDynamo. So, we decide to disallow these ops
+# in FX graph, allowing them to run them on eager, with torch.compile.
+dynamo_unsupported_distributed_c10d_ops = [


I am missing the context here. Is the idea just to exclude all the operations defined in distributed_c10d.py?

Or specifically, the operators that do not work well with TorchDynamo tracing and we want to graph break (i.e. fallback to eager) on them

@H-Huang ping

I see, then would it be possible to just get all operations of the module through something like:
https://stackoverflow.com/questions/139180/how-to-list-all-functions-in-a-module

Then we don't have to create this list in distributed_c10d and worry about keeping it up to date

@H-Huang Yes, so we tried dir(module) and __dict__ etc but then it gives a large number of functions and many of them not really "operators".

Another way I tried earlier in a commit - 8e6dee3
was to record the ops in exception_handler. The assumption was that anyone adding a new op will decorate it withe exception handler. But, that looked little hacky as well.

I see. Thanks for clarifying. This looks like the route to go then. Should init_process_group be included in this list? This is not a conventional "operator"

Thanks for pointing that out. Removed it from the list.

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

H-Huang

LGTM

anijain2305 · 2023-05-02T16:30:13Z

@pytorchbot merge

pytorchmergebot · 2023-05-02T16:32:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-02T16:42:13Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

anijain2305 · 2023-05-02T20:07:24Z

@pytorchbot merge -f "unrelated CI error"

pytorchmergebot · 2023-05-02T20:09:18Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[dynamo][moco] Disallow_in_graph distributed APIs

30eb113

[ghstack-poisoned]

anijain2305 mentioned this pull request Apr 26, 2023

[dynamo][hf_bigbird] Actually graph break on tensor.unsqueeze_/resize_ #99986

Closed

anijain2305 added a commit that referenced this pull request Apr 26, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

73b7cfa

ghstack-source-id: 66e2cebe7268d9f475a3b1a4ce5389c63dc8980d Pull Request resolved: #100071

github-actions bot added ciflow/inductor module: dynamo labels Apr 26, 2023

anijain2305 added the topic: not user facing topic category label Apr 26, 2023

anijain2305 requested review from wconstab and jansel April 26, 2023 06:04

jansel approved these changes Apr 26, 2023

View reviewed changes

anijain2305 added a commit that referenced this pull request Apr 26, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

b7f66b2

ghstack-source-id: 66e2cebe7268d9f475a3b1a4ce5389c63dc8980d Pull Request resolved: #100071

Update on "[dynamo][moco] Disallow_in_graph distributed APIs"

8e6dee3

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

anijain2305 requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners April 26, 2023 22:00

anijain2305 added a commit that referenced this pull request Apr 26, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

9278927

ghstack-source-id: 25b947f582f7477f52b6d7d47e8b1d7785ba4d80 Pull Request resolved: #100071

anijain2305 commented Apr 26, 2023

View reviewed changes

anijain2305 added a commit that referenced this pull request Apr 27, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

544fc7e

ghstack-source-id: 25b947f582f7477f52b6d7d47e8b1d7785ba4d80 Pull Request resolved: #100071

anijain2305 mentioned this pull request Apr 27, 2023

[dynamo] hasattr on TensorVariable #100154

Closed

Update on "[dynamo][moco] Disallow_in_graph distributed APIs"

6e801a3

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

anijain2305 added a commit that referenced this pull request Apr 27, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

5c5369d

ghstack-source-id: d0c52747c5f5b618733411021aa8245a5cd7bbde Pull Request resolved: #100071

Update on "[dynamo][moco] Disallow_in_graph distributed APIs"

5fc52b5

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

anijain2305 mentioned this pull request Apr 27, 2023

[dynamo] hasattr on TensorVariable #100199

Closed

anijain2305 added a commit that referenced this pull request Apr 27, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

28f7fae

ghstack-source-id: 0fe6edef674a88c22742b9fa57f453c3614c88ef Pull Request resolved: #100071

Update on "[dynamo][moco] Disallow_in_graph distributed APIs"

ffd5bbe

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

anijain2305 added a commit that referenced this pull request Apr 27, 2023

[dynamo][moco] Disallow_in_graph distributed APIs

1539bed

ghstack-source-id: 0fe6edef674a88c22742b9fa57f453c3614c88ef Pull Request resolved: #100071

H-Huang reviewed Apr 27, 2023

View reviewed changes

Update on "[dynamo][moco] Disallow_in_graph distributed APIs"

16687a2

cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

H-Huang approved these changes May 2, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 2, 2023

pytorchmergebot added the merging label May 2, 2023

anijain2305 mentioned this pull request May 2, 2023

Tracker - Failing models in the torch.compile dashboard #98561

Open

12 tasks

pytorchmergebot removed the merging label May 2, 2023

pytorchmergebot added merging Merged and removed merging labels May 2, 2023

pytorchmergebot closed this in 5fbb406 May 2, 2023

facebook-github-bot deleted the gh/anijain2305/15/head branch June 8, 2023 15:15

Tulsishah mentioned this pull request Dec 5, 2023

Pytorch 2.0 support compile mode GoogleCloudPlatform/gcsfuse#1539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dynamo][moco] Disallow_in_graph distributed APIs #100071

[dynamo][moco] Disallow_in_graph distributed APIs #100071

anijain2305 commented Apr 26, 2023 •

edited

pytorch-bot bot commented Apr 26, 2023 •

edited

anijain2305 commented Apr 26, 2023

wconstab commented Apr 26, 2023

yanboliang commented Apr 26, 2023

anijain2305 Apr 26, 2023

awgu Apr 26, 2023

wconstab Apr 27, 2023

H-Huang Apr 27, 2023

anijain2305 Apr 27, 2023

anijain2305 Apr 27, 2023

anijain2305 Apr 28, 2023

H-Huang Apr 28, 2023

anijain2305 Apr 28, 2023 •

edited

H-Huang May 2, 2023

anijain2305 May 2, 2023

H-Huang left a comment

anijain2305 commented May 2, 2023

pytorchmergebot commented May 2, 2023

pytorchmergebot commented May 2, 2023

anijain2305 commented May 2, 2023

pytorchmergebot commented May 2, 2023

[dynamo][moco] Disallow_in_graph distributed APIs #100071

[dynamo][moco] Disallow_in_graph distributed APIs #100071

Conversation

anijain2305 commented Apr 26, 2023 • edited

pytorch-bot bot commented Apr 26, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100071

❗ 1 Active SEVs

❌ 3 New Failures

anijain2305 commented Apr 26, 2023

wconstab commented Apr 26, 2023

yanboliang commented Apr 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anijain2305 Apr 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

H-Huang left a comment

Choose a reason for hiding this comment

anijain2305 commented May 2, 2023

pytorchmergebot commented May 2, 2023

Merge started

pytorchmergebot commented May 2, 2023

Merge failed

anijain2305 commented May 2, 2023

pytorchmergebot commented May 2, 2023

Merge started

anijain2305 commented Apr 26, 2023 •

edited

pytorch-bot bot commented Apr 26, 2023 •

edited

anijain2305 Apr 28, 2023 •

edited