Inductor support for aten::all_reduce #93111

wconstab · 2023-01-26T23:58:01Z

Stack from ghstack (oldest at bottom):

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

[ghstack-poisoned]

pytorch-bot · 2023-01-26T23:58:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93111

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 0a61bca:

NEW FAILURES - The following jobs have failed:

cuda11.7-py3.10-gcc7-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/ir.py

torch/_inductor/scheduler.py

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

torch/_inductor/ir.py

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: 0c99a6a803ab2cd9fd71242b8b7dd3d191e0c3eb Pull Request resolved: #93111

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wanchaol

lgtm, just some questions as I'm learning how inductor works and this looks a great example!

wanchaol · 2023-02-16T16:36:40Z

test/distributed/test_traceable_collectives.py

+
+
+@requires_nccl()
+class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):


pretty interested on how SingleProcTestCase works for collective, is it doing allreduce on a single rank?

it is calling the allreduce op but maybe the nccl kernel is skipped.

these tests are only measuring whether inductor generates the right code to call into dist.* apis.

I assume the apis will work as intended.

Above there is one 'real' integration test that runs multi-proc

wanchaol · 2023-02-16T16:51:06Z

test/distributed/test_traceable_collectives.py

+            with self.assertRaisesRegex(RuntimeError, "derivative for aten::all_reduce is not implemented"):
+                compiled = torch.compile(func, backend="aot_eager")  # inductor bug with single-op allreduce graph
+                out = compiled(input, **self.get_world_trs())
+                out.sum().backward()


Oh I thought we didn't implement the allreduce backward yet, so it's a dummy function right now and we just test the correctness of dummy function to see if it could work here?

yea, i should delete this test. I asked Rodrigo to cover this in his own test file, and he did. Also, he dropped backward support for now so i think his test is a stub. we will add it later.

wanchaol · 2023-02-16T16:52:18Z

torch/_inductor/ir.py

+        )
+
+    def codegen(self, wrapper):
+        wrapper.add_import_once("import torch.distributed as dist")


So right now it generates the triton python code I suppose, would it possible to generate a C++ kernel in the future?

all of the python code i'm generating here is not triton code. Triton is generated by one layer deeper of inductor, when it does a 'fusion' of some ops and then codegens a kernel. This code here is going into the 'top level wrapper' script inductor generates, which is what calls the generated triton kernels and also calls other eager ops or allocations etc.

the python wrapper code can also be changed to c++, and that's part of the 'aot inductor' workstream.

wanchaol · 2023-02-16T16:53:59Z

test/distributed/test_traceable_collectives.py

+
+    def test_dynamo_trace_allreduce(self):
+        def func(inp, *, tag, ranks, group_size):
+            ar = torch.ops.aten.all_reduce(inp, "sum", tag, ranks, group_size)


dynamo works because we are calling the aten op not the functional collective directly, so we get around the AsyncTensor subclass?

oh, this is not the real dynamo support. See a later PR in this stack. I change this test to call the real collective and make changes to dynamo to fix it.

kumpera · 2023-02-16T17:12:48Z

torch/_inductor/ir.py

+            "from torch.distributed._functional_collectives import _str_to_reduce_op"
+        )
+        wrapper.add_import_once(
+            "from torch.distributed.distributed_c10d import _find_or_create_pg_by_ranks_and_tag"


I find using c10d internals a bit problematic but we can iterate over this later.

happy to iterate. but you'll have to be more specific about the problem :)

kumpera · 2023-02-16T17:26:24Z

torch/_inductor/ir.py

+
+        # TODO: avoid more than one ref of the same pg (even though they are cached inside the api)
+        wrapper.writeline(
+            f"{output_name}_pg = _find_or_create_pg_by_ranks_and_tag('{tag}', {ranks}, {group_size})"


This should be cached across invocations.

yea, currently we will be constructing more than one obj that really are the same pg. (and calling _find_or_create more than one time for the same pg)

is this a serious problem at all? I assumed it is 'safe' but also not ideal. My todo above was framed as a cleanup for later. But if you see a more serious issue let me know

anj-s · 2023-02-16T17:46:27Z

torch/_inductor/scheduler.py

@@ -313,6 +325,25 @@ def debug_str_extra(self):
    def is_extern(self):
        return True

+    def can_inplace(self, read_dep: dependencies.MemoryDep):


maybe i am missing something, but where is this function used?

it's an 'interface' function already existing in inductor's scheduler. i'm just defining specific behavior for this subclass

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab · 2023-02-16T19:35:05Z

@pytorchbot merge

pytorchmergebot · 2023-02-16T19:38:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-16T20:44:17Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12)

Details for Dev Infra team

Raised by workflow job

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab · 2023-02-17T03:34:45Z

pytorchbot merge -f "Flaky CI (unable to download huggingface model)"

wconstab · 2023-02-17T04:40:15Z

@pytorchbot merge -f "Flaky CI (unable to download huggingface model)"

pytorchmergebot · 2023-02-17T04:41:59Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit a8cbf70.

Pull Request resolved: pytorch#93111 Approved by: https://github.com/jansel, https://github.com/wanchaol

Inductor support for aten::all_reduce

ba04c86

[ghstack-poisoned]

This was referenced Jan 26, 2023

Add aten::all_reduce with meta impl #93109

Closed

Eager impl for aten::all_reduce #93110

Closed

Test for aten::all_reduce #93112

Closed

allred.py demo script (not for land) #93113

Closed

github-actions bot added ciflow/inductor module: inductor labels Jan 26, 2023

jansel requested changes Jan 27, 2023

View reviewed changes

torch/_inductor/ir.py Outdated Show resolved Hide resolved

torch/_inductor/ir.py Outdated Show resolved Hide resolved

torch/_inductor/ir.py Outdated Show resolved Hide resolved

torch/_inductor/scheduler.py Outdated Show resolved Hide resolved

Update on "Inductor support for aten::all_reduce"

c77ba15

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab mentioned this pull request Jan 28, 2023

Refactor dynamo distributed test helpers to be reusable #93187

Closed

wconstab added 3 commits January 30, 2023 17:45

Update on "Inductor support for aten::all_reduce"

aa76f6d

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

c83ee3e

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

a673f6b

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab mentioned this pull request Jan 30, 2023

[abandoned] - Debug inductor lowering for allreduce #92735

Closed

wconstab added 3 commits January 31, 2023 01:16

Update on "Inductor support for aten::all_reduce"

dfcf748

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

7ce7416

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

851a8fa

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This was referenced Feb 2, 2023

Refactor to allow reuse of SchedulerNode.allocate #93328

Closed

Mark buffers that reuse other buffers #93329

Closed

Update on "Inductor support for aten::all_reduce"

793be98

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab mentioned this pull request Feb 2, 2023

Simplify scheduler allocate logic #93897

Closed

wconstab added 2 commits February 2, 2023 05:34

Update on "Inductor support for aten::all_reduce"

9ad496e

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

edad4be

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This was referenced Feb 3, 2023

[mockup of Rodrigo's PR] Traceable AllReduce Support #94059

Closed

update allred #94103

Closed

update test #94104

Closed

update inductor #94105

Closed

wconstab added 2 commits February 6, 2023 18:45

Update on "Inductor support for aten::all_reduce"

0d111e6

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

16d3894

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

jansel approved these changes Feb 13, 2023

View reviewed changes

torch/_inductor/ir.py Show resolved Hide resolved

Update on "Inductor support for aten::all_reduce"

f33314e

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab requested a review from fegin as a code owner February 13, 2023 21:58

wconstab added 2 commits February 13, 2023 23:38

Update on "Inductor support for aten::all_reduce"

eb45a40

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

7b1026b

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab added a commit that referenced this pull request Feb 15, 2023

Inductor support for aten::all_reduce

8a842b8

ghstack-source-id: 0c99a6a803ab2cd9fd71242b8b7dd3d191e0c3eb Pull Request resolved: #93111

wconstab added 3 commits February 15, 2023 23:17

Update on "Inductor support for aten::all_reduce"

1402b00

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

c921522

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

da0ac0e

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wanchaol approved these changes Feb 16, 2023

View reviewed changes

kumpera reviewed Feb 16, 2023

View reviewed changes

anj-s reviewed Feb 16, 2023

View reviewed changes

Update on "Inductor support for aten::all_reduce"

43b6f8c

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 16, 2023

wconstab added 2 commits February 16, 2023 21:42

Update on "Inductor support for aten::all_reduce"

9618b3d

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Inductor support for aten::all_reduce"

0a61bca

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

pytorchmergebot added the Merged label Feb 17, 2023

pytorchmergebot closed this in a8cbf70 Feb 17, 2023

msaroufim mentioned this pull request Mar 3, 2023

Remove mention of dynamo.optimize() in docs #96002

Closed

pruthvistony added a commit to ROCm/pytorch that referenced this pull request May 2, 2023

Revert "Inductor support for aten::all_reduce (pytorch#93111)"

3b7b0b4

This reverts commit a8cbf70.

facebook-github-bot deleted the gh/wconstab/79/head branch June 8, 2023 19:19

jhavukainen pushed a commit to kulinseth/pytorch that referenced this pull request Mar 15, 2024

Inductor support for aten::all_reduce (pytorch#93111)

4e1ccd8

Pull Request resolved: pytorch#93111 Approved by: https://github.com/jansel, https://github.com/wanchaol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inductor support for aten::all_reduce #93111

Inductor support for aten::all_reduce #93111

wconstab commented Jan 26, 2023 •

edited

pytorch-bot bot commented Jan 26, 2023 •

edited

wanchaol left a comment

wanchaol Feb 16, 2023

wconstab Feb 16, 2023

wanchaol Feb 16, 2023

wconstab Feb 16, 2023

wanchaol Feb 16, 2023

wconstab Feb 16, 2023

wanchaol Feb 16, 2023

wconstab Feb 16, 2023

kumpera Feb 16, 2023

wconstab Feb 16, 2023

kumpera Feb 16, 2023

wconstab Feb 16, 2023

anj-s Feb 16, 2023

wconstab Feb 16, 2023

wconstab commented Feb 16, 2023

pytorchmergebot commented Feb 16, 2023

pytorchmergebot commented Feb 16, 2023

wconstab commented Feb 17, 2023

wconstab commented Feb 17, 2023

pytorchmergebot commented Feb 17, 2023



		@requires_nccl()
		class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):

Inductor support for aten::all_reduce #93111

Inductor support for aten::all_reduce #93111

Conversation

wconstab commented Jan 26, 2023 • edited

pytorch-bot bot commented Jan 26, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93111

❌ 1 Failures

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconstab commented Feb 16, 2023

pytorchmergebot commented Feb 16, 2023

Merge started

pytorchmergebot commented Feb 16, 2023

Merge failed

wconstab commented Feb 17, 2023

wconstab commented Feb 17, 2023

pytorchmergebot commented Feb 17, 2023

Merge started

wconstab commented Jan 26, 2023 •

edited

pytorch-bot bot commented Jan 26, 2023 •

edited