Dynamo - config gated torch.distributed allow, exclusion for special leaf funcs #110894

voznesenskym · 2023-10-09T21:22:45Z

Stack from ghstack (oldest at bottom):

-> Dynamo - config gated torch.distributed allow, exclusion for special leaf funcs #110894

is_allowed is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/)

If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving.

cc @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @kadeng

[ghstack-poisoned]

pytorch-bot · 2023-10-09T21:22:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110894

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 50bfd09 with merge base 1e7947b ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / linux-focal-rocm5.6-py3.8 / test (default, 2, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: b820faf Pull Request resolved: #110894

voznesenskym · 2023-10-09T21:26:50Z

torch/_dynamo/skipfiles.py

    _module_dir(torch) + "_export/wrappers.py",
 }

+if torch.distributed.is_available():


config this too, I guess.

Actually, I am not sure this matters, let's see if tests fail, the inline vs allow refactor has made this a little more confusing.

…stributed" `is_allowed` is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/) If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving. cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]

ghstack-source-id: 9d25c33 Pull Request resolved: #110894 Fix

…stributed" `is_allowed` is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/) If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving. cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]

ghstack-source-id: 2b9902b Pull Request resolved: #110894 Fix Fix

ezyang · 2023-10-10T20:34:17Z

I'm signing off in terms of functional correctness for the hack.

However, @yanboliang should have the final say here, since this increases his workload for the refactor.

yanboliang · 2023-10-10T22:11:46Z

torch/_dynamo/skipfiles.py

+    FILENAME_INLINELIST |= set(
+        glob.glob(_module_dir(torch) + "distributed/**/*.py", recursive=True),
+    )
+


Can you add torch.distributed into the SUBMODULE_INLINELIST? That would be a more easy way to force inline all files under a submodule.

No, this fails for whatever reason.

If you want a crack at debugging:

voz/fsdp_autograd3

build

Patch this in

torchrun --standalone --nproc_per_node=2 fsdp.py

I think this would break trunk since you land after #110835. Do you mind to send a forward fix? Or I can help to forward fix it.

yanboliang · 2023-10-10T22:14:09Z

torch/_dynamo/allowed_functions.py


+# A subcheck of is_allowed, we utilize this for patching is_allowed around distributed.
+# We do this because we want to allow these to be traced, and hence covered in skipfiles, but we do not want them to
+# become TorchVariable


If you don't want to make them as TorchVariable, we should not make them is_allowed returning True after my refactor. If you just want to inline these functions, can you treat them as regular python functions and go through the regular inline rules?

We need some to go one way, some the other, the logic here is correct - and they shouldn't go through inline rules.

Do you expect them to be FX graph node? If yes, they should be wrapped as TorchVariable and is_allowed returning True. But this is not what described in the comments.

The problem is they are under torch.* so they get routed to TorchVariable - for some like

if obj in [ torch.distributed._functional_collectives_impl._all_gather_into_tensor, torch.distributed._functional_collectives_impl._all_reduce, torch.distributed._functional_collectives_impl._reduce_scatter_tensor, torch.distributed._functional_collectives_impl._all_reduce_coalesced, torch.distributed._functional_collectives_impl._all_gather_into_tensor_coalesced, torch.distributed._functional_collectives_impl._reduce_scatter_tensor_coalesced, ]:

It is correct for them to become TorchVariable

For others (the rest of the checks) - we need to make sure they do not become TorchVariable. If we don't have this function, the wrong types will become TorchVariable instead of passing through CollectiveFunctionRewriteVariable.can_rewrite(value) CollectiveFunctionRewriteVariable.can_rewrite(value) etc

I don't quite understand Voz's explanation, but I certainly agree with Voz that there is something funny going on here.

To give an alternate example, on main, I'd like to inline into functions in _functional_collectives. I applied this patch:

diff --git a/torch/_dynamo/allowed_functions.py b/torch/_dynamo/allowed_functions.py index 8beca2b4502..d155023ef54 100644 --- a/torch/_dynamo/allowed_functions.py +++ b/torch/_dynamo/allowed_functions.py @@ -175,6 +175,7 @@ def _allowed_function_ids(): # issues observed in # https://github.com/pytorch/pytorch/issues/108269 "torch.distributed.algorithms.", + "torch.distributed._functional_collectives.", ) allowed_modules_dot = tuple([x + "." for x in allowed_modules]) module = inspect.getmodule(obj)

But it does not work; for some reason I appear to still be trying to place things like all_to_all_single directly into the graph. With Voz's PR and my patch, and turning on trace_distributed, only then do I get the expected behavior of inlining.

more of a meta question but do we have a design doc for "the new skipfiles/allowed_functions design"? I am a little wary of all this complexity around distributed, hopefully we can do it more cleanly in a redesign

@wconstab I have this doc to track the new skipfiles/allowed_function thing: https://docs.google.com/document/d/15gk0B-aLGfQTdffTcFbPzA3DLR1ZwrnJawmt_4kflOY/edit?userstoinvite=jansel@meta.com&sharingaction=manageaccess&role=writer . Feel free to comment and leave feedback.

ghstack-source-id: 2b9902b Pull Request resolved: pytorch#110894 Fix Fix

…stributed" `is_allowed` is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/) If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving. cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]

ezyang · 2023-10-11T18:59:38Z

torch/_dynamo/config.py

    [name for name, _ in inspect.getmembers(torch.Tensor) if re.match(r"^is_.*", name)]
 )

+trace_distributed = True


ghstack-source-id: e99f571 Pull Request resolved: #110894 Fix Fix A little config magic A little config magic

ezyang · 2023-10-11T19:00:06Z

torch/_dynamo/allowed_functions.py

    _find_torch_objects(torch)
    _find_torch_objects(math)

+    if config.trace_distributed:


do you have to import torch.distribute._fun... here?

It seems like we do not.

voznesenskym · 2023-10-12T06:27:52Z

@pytorchbot merge

voznesenskym · 2023-10-12T06:28:02Z

@yanboliang, revert if its not up to snuff, nbd nbd :)

pytorchmergebot · 2023-10-12T06:29:49Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

voznesenskym · 2023-10-12T06:49:44Z

@pytorchbot merge

pytorchmergebot · 2023-10-12T06:51:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

distributed skipfile and allow check

bfce2a8

[ghstack-poisoned]

voznesenskym added a commit that referenced this pull request Oct 9, 2023

distributed skipfile and allow check

0703054

ghstack-source-id: b820faf Pull Request resolved: #110894

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, bdhirsh, ezyang, miladm and wconstab October 9, 2023 21:23

github-actions bot added module: dynamo ciflow/inductor labels Oct 9, 2023

voznesenskym changed the title ~~distributed skipfile and allow check~~ Dynamo - config gated is_allowed routing, skipfiles for distributed Oct 9, 2023

voznesenskym commented Oct 9, 2023

View reviewed changes

albanD removed their request for review October 9, 2023 22:03

voznesenskym added a commit that referenced this pull request Oct 9, 2023

distributed skipfile and allow check

25ef829

ghstack-source-id: 9d25c33 Pull Request resolved: #110894 Fix

voznesenskym added a commit that referenced this pull request Oct 10, 2023

distributed skipfile and allow check

73d9b38

ghstack-source-id: 2b9902b Pull Request resolved: #110894 Fix Fix

ezyang requested a review from yanboliang October 10, 2023 20:33

yanboliang reviewed Oct 10, 2023

View reviewed changes

ezyang pushed a commit to ezyang/pytorch that referenced this pull request Oct 11, 2023

distributed skipfile and allow check

b54d222

ghstack-source-id: 2b9902b Pull Request resolved: pytorch#110894 Fix Fix

ezyang reviewed Oct 11, 2023

View reviewed changes

voznesenskym added a commit that referenced this pull request Oct 11, 2023

distributed skipfile and allow check

fad146d

ghstack-source-id: e99f571 Pull Request resolved: #110894 Fix Fix A little config magic A little config magic

voznesenskym changed the title ~~Dynamo - config gated is_allowed routing, skipfiles for distributed~~ Dynamo - config gated torch.distributed allow, exclusion for special leaf funcs Oct 11, 2023

ezyang approved these changes Oct 11, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 12, 2023

pytorchmergebot added the merging label Oct 12, 2023

pytorchmergebot removed the merging label Oct 12, 2023

voznesenskym added the topic: not user facing topic category label Oct 12, 2023

pytorchmergebot added the merging label Oct 12, 2023

pytorchmergebot added Merged and removed merging labels Oct 12, 2023

pytorchmergebot closed this in 395d0ea Oct 12, 2023

facebook-github-bot deleted the gh/voznesenskym/239/head branch October 15, 2023 14:25

Dynamo - config gated torch.distributed allow, exclusion for special leaf funcs #110894

Dynamo - config gated torch.distributed allow, exclusion for special leaf funcs #110894

Uh oh!

Conversation

voznesenskym commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110894

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Oct 10, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voznesenskym Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voznesenskym commented Oct 12, 2023

Uh oh!

voznesenskym commented Oct 12, 2023

Uh oh!

pytorchmergebot commented Oct 12, 2023

Merge failed

Uh oh!

voznesenskym commented Oct 12, 2023

Uh oh!

pytorchmergebot commented Oct 12, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voznesenskym commented Oct 9, 2023 •

edited

Loading

pytorch-bot bot commented Oct 9, 2023 •

edited

Loading

voznesenskym Oct 11, 2023 •

edited

Loading