[inductor] decomposition for complex addition #110740

htyu · 2023-10-06T18:47:42Z

Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for torch.angle which is handled in 105609. In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions.

@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 6
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

pytorch-bot · 2023-10-06T18:47:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110740

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4ca773b with merge base 0a26e5f ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-10-06T18:47:46Z

The committers listed above are authorized under a signed CLA.

✅ login: htyu / name: Hongtao Yu (c91f9b2, 63bcb11, e8be014, b284ce9, 4ca773b)

htyu · 2023-10-09T16:41:46Z

@jansel Can you please take a look? There are some test failures that seem unrelated and look like infra issue.

jansel · 2023-10-10T02:31:58Z

torch/_inductor/decomposition.py

+    r = x.real + r
+    if x_is_complex_tensor:
+        i = x.imag + i
+    complex_type = x.dtype if x_is_complex_tensor else y.dtype


torch.promote_types

torch/_inductor/decomposition.py

jansel · 2023-10-10T02:38:11Z

torch/_inductor/decomposition.py

+    return (
+        torch.where(
+            torch.arange(2, device=x.device, dtype=torch.uint8) == 0,
+            r.unsqueeze(-1),
+            i.unsqueeze(-1),
+        )
+        .view(complex_type)
+        .squeeze(-1)
+    )


For add in particular I think you can just do:

return (x.view(torch.float32)+x.view(torch.float32)).view(complex_type)

Since you do the same thing for both .real and .complex.

This is cool. Thanks for the suggestion. However, the generated code still have aten.view calls and CSE cannot be done across kenerls. I guess the goal here is to optimize away aten.view?

with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [Z], Original ATen: [aten.add] buf0 = aten.view(arg0_1, torch.float32) del arg0_1 buf1 = buf0 del buf0 # Source Nodes: [Z], Original ATen: [aten.add] buf2 = aten.view(arg1_1, torch.float32) del arg1_1 buf3 = buf2 del buf2 buf4 = buf1; del buf1 # reuse # Source Nodes: [Z], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf4, buf3, 2000000, grid=grid(2000000), stream=stream0) del buf3 # Source Nodes: [Z], Original ATen: [aten.add] buf5 = aten.view(buf4, torch.complex64) del buf4 buf6 = buf5 del buf5 return (buf6, )

Also the new form does not handle that when either x or y is a scalar.

Yeah, fixing the view problems requires teaching inductor how to do complex views.

The scalar case would need to be handled differently. You could first convert the scalar to complex.

@jansel I think an easier thing to do would just be to pattern-match away redundant view calls right?

Where would you do that? BTW, are they redundant? I thought without the view calls, the type of the tensors wouldn't be correct.

In this case, they're not redundant. But if you have two adds in a row, the views in the way would prevent fusion.

So for a single complex add, it would be

x.view(torch.float64) x + x x.view(torch.complex64)

Then for two complex adds, it would be

x.view(torch.float64) x + x x.view(torch.complex64) x.view(torch.float64) x + x x.view(torch.complex64)

There is the additional problem in that these views are getting mapped to fallback kernels, which don't properly encode the aliasing relationship. So inductor may try to reuse the memory for something else, which is a correctness issue.

I’m working on a new inductor IR node that handles complex/float views to avoid calling into the runtime. According to recent discussion, it looks like the triton community doesn’t like supporting complex natively there. But do you see a way to handle the views without using a triton kernel? Once the views are lowered, inductor will fuse the lowered code into the same triton kernel, which exposes complex64 to triton.

torch/_inductor/decomposition.py

EikanWang · 2023-10-13T07:38:25Z

Awesome! I'm also looking for an optimal methodology to support Complex. @htyu , May I know the performance behavior? I'm kind of concerned about the performance on the Triton side, as the non-contiguous access may not be vectorized easily.

htyu · 2023-10-13T16:14:00Z

Awesome! I'm also looking for an optimal methodology to support Complex. @htyu , May I know the performance behavior? I'm kind of concerned about the performance on the Triton side, as the non-contiguous access may not be vectorized easily.

Optimal performance is our goal too in general. For this particular issue, since there is no strided memory accesses created, the performance is acutally on-par with aten.add call for a single add operation, but I hope with some follow-up work, we should be able to fold/CSE multiple add operations, which would perform better than the call version.

Non-contiguous access will be an issue for other complex num operations such as matmul. There is a discussion about whether to support complex natively in Triton which may be smarter to handle non-contiguous access with packing/unpacking: https://triton-lang.slack.com/archives/C01LY4FJL56/p1697060717465109 . You are welcome to discuss there. My understanding is it's not easy to come up with an optimal scheme for both cuda cores and tensor cores.

torch/_inductor/ir.py

torch/_inductor/graph.py

torch/_inductor/lowering.py

jansel · 2023-10-16T01:03:15Z

test/inductor/test_torchinductor.py

@@ -611,6 +611,15 @@ def fn(x, y):

        self.common(fn, (x, y))

+    def test_add_complex(self):


Add tests for:

complex + non-complex

complex + scalar

alpha=...

Also if you have a test that demonstrates the aliasing issue with the prior version you should add that too.

Added test for alpha.

For # 1 and # 2, I'm afraid that a strided load will be needed as the computation has nothing to do with the imaginary part. Alternatively we would load the complex number as a whole in a thread, but for complex128 we need int128 support which isn't there in Trition. Can we leave them in a separate change until downstream support is ready?

Also if you have a test that demonstrates the aliasing issue with the prior version you should add that too.

Somehow I couldn't repro this anymore :( Could it be due to recent has_aliasing change #110651 ?

lezcano

I don't think this is the correct approach. How would you implement things like multiplicaiton or abs? If you do it naïvely, you'll end up with 2 non-contiguous loads which will give you a pretty horrific performance, and as it stands, it's not possible to do it better without introducing quite a few structural changes to how inductor loads values.

This is an intrinsic limitation of this approach, as addition is pretty much the only operation you can implement without having to split the real and complex part.

I think that, if we want to support complex number, we should look into helping adding complex number support within triton, and supporting complex numbers natively in inductor.

lezcano · 2023-10-16T19:46:25Z

After reading #98161 (comment), it looks like triton does not want to support complex numbers, but we would still need to have a way to load complex tensors efficiently, and have a way to implement performant lowerings for complex numbers in triton, which may be different to those from CPU, as CPU already has all these ops implemented in the scalar and vectorized dtypes

htyu · 2023-10-23T22:49:12Z

@pytorchbot label "topic: not user facing"

htyu · 2023-10-23T22:49:51Z

@pytorchbot merge

pytorchmergebot · 2023-10-23T22:51:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-10-23T22:51:47Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

htyu · 2023-10-24T03:39:34Z

@pytorchbot merge -f "bypass unrelated failure"

pytorchmergebot · 2023-10-24T03:41:14Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

As a follow-up to #110740, this patches enables removing redundant complex views to allow more operation fusing. E.g, given ``` @torch.compile def foo(X, Y): Z = X + Y A = X + Y return A + Z ``` the generated code is: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tmp3 = tmp2 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) ''') def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [add_2], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [add_2], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) del buf4 buf6 = buf5 del buf5 return (buf6, ) ``` whereas previously the generated code was: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [A], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [A], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) buf6 = buf5 del buf5 # Source Nodes: [add_2], Original ATen: [aten.add] buf7 = aten.view.dtype(buf6, torch.float32) del buf6 buf8 = buf7 del buf7 # Source Nodes: [Z], Original ATen: [aten.add] buf9 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf10 = buf9 del buf9 # Source Nodes: [Z], Original ATen: [aten.add] buf11 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf12 = buf11 del buf11 buf13 = buf4; del buf4 # reuse # Source Nodes: [Z], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf10, buf12, buf13, 6, grid=grid(6), stream=stream0) del buf10 del buf12 # Source Nodes: [Z], Original ATen: [aten.add] buf14 = aten.view.dtype(buf13, torch.complex64) buf15 = buf14 del buf14 # Source Nodes: [add_2], Original ATen: [aten.add] buf16 = aten.view.dtype(buf15, torch.float32) del buf15 buf17 = buf16 del buf16 buf18 = buf13; del buf13 # reuse # Source Nodes: [add_2], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf8, buf17, buf18, 6, grid=grid(6), stream=stream0) del buf17 del buf8 # Source Nodes: [add_2], Original ATen: [aten.add] buf19 = aten.view.dtype(buf18, torch.complex64) del buf18 buf20 = buf19 del buf19 return (buf20, ) ``` Pull Request resolved: #111773 Approved by: https://github.com/jansel

Tracks pytorch#98161 Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](pytorch#105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: pytorch#110740 Approved by: https://github.com/jansel

As a follow-up to pytorch#110740, this patches enables removing redundant complex views to allow more operation fusing. E.g, given ``` @torch.compile def foo(X, Y): Z = X + Y A = X + Y return A + Z ``` the generated code is: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tmp3 = tmp2 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) ''') def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [add_2], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [add_2], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) del buf4 buf6 = buf5 del buf5 return (buf6, ) ``` whereas previously the generated code was: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [A], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [A], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) buf6 = buf5 del buf5 # Source Nodes: [add_2], Original ATen: [aten.add] buf7 = aten.view.dtype(buf6, torch.float32) del buf6 buf8 = buf7 del buf7 # Source Nodes: [Z], Original ATen: [aten.add] buf9 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf10 = buf9 del buf9 # Source Nodes: [Z], Original ATen: [aten.add] buf11 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf12 = buf11 del buf11 buf13 = buf4; del buf4 # reuse # Source Nodes: [Z], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf10, buf12, buf13, 6, grid=grid(6), stream=stream0) del buf10 del buf12 # Source Nodes: [Z], Original ATen: [aten.add] buf14 = aten.view.dtype(buf13, torch.complex64) buf15 = buf14 del buf14 # Source Nodes: [add_2], Original ATen: [aten.add] buf16 = aten.view.dtype(buf15, torch.float32) del buf15 buf17 = buf16 del buf16 buf18 = buf13; del buf13 # reuse # Source Nodes: [add_2], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf8, buf17, buf18, 6, grid=grid(6), stream=stream0) del buf17 del buf8 # Source Nodes: [add_2], Original ATen: [aten.add] buf19 = aten.view.dtype(buf18, torch.complex64) del buf18 buf20 = buf19 del buf19 return (buf20, ) ``` Pull Request resolved: pytorch#111773 Approved by: https://github.com/jansel

Tracks pytorch#98161 Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](pytorch#105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: pytorch#110740 Approved by: https://github.com/jansel

As a follow-up to pytorch#110740, this patches enables removing redundant complex views to allow more operation fusing. E.g, given ``` @torch.compile def foo(X, Y): Z = X + Y A = X + Y return A + Z ``` the generated code is: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tmp3 = tmp2 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) ''') def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [add_2], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [add_2], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) del buf4 buf6 = buf5 del buf5 return (buf6, ) ``` whereas previously the generated code was: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [A], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [A], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) buf6 = buf5 del buf5 # Source Nodes: [add_2], Original ATen: [aten.add] buf7 = aten.view.dtype(buf6, torch.float32) del buf6 buf8 = buf7 del buf7 # Source Nodes: [Z], Original ATen: [aten.add] buf9 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf10 = buf9 del buf9 # Source Nodes: [Z], Original ATen: [aten.add] buf11 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf12 = buf11 del buf11 buf13 = buf4; del buf4 # reuse # Source Nodes: [Z], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf10, buf12, buf13, 6, grid=grid(6), stream=stream0) del buf10 del buf12 # Source Nodes: [Z], Original ATen: [aten.add] buf14 = aten.view.dtype(buf13, torch.complex64) buf15 = buf14 del buf14 # Source Nodes: [add_2], Original ATen: [aten.add] buf16 = aten.view.dtype(buf15, torch.float32) del buf15 buf17 = buf16 del buf16 buf18 = buf13; del buf13 # reuse # Source Nodes: [add_2], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf8, buf17, buf18, 6, grid=grid(6), stream=stream0) del buf17 del buf8 # Source Nodes: [add_2], Original ATen: [aten.add] buf19 = aten.view.dtype(buf18, torch.complex64) del buf18 buf20 = buf19 del buf19 return (buf20, ) ``` Pull Request resolved: pytorch#111773 Approved by: https://github.com/jansel

Tracks pytorch#98161 Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](pytorch#105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: pytorch#110740 Approved by: https://github.com/jansel

As a follow-up to pytorch#110740, this patches enables removing redundant complex views to allow more operation fusing. E.g, given ``` @torch.compile def foo(X, Y): Z = X + Y A = X + Y return A + Z ``` the generated code is: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tmp3 = tmp2 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) ''') def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [add_2], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [add_2], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) del buf4 buf6 = buf5 del buf5 return (buf6, ) ``` whereas previously the generated code was: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (3, ), (1, )) assert_size_stride(arg1_1, (3, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # no-op to ensure context # Source Nodes: [A], Original ATen: [aten.add] buf0 = aten.view.dtype(arg0_1, torch.float32) buf1 = buf0 del buf0 # Source Nodes: [A], Original ATen: [aten.add] buf2 = aten.view.dtype(arg1_1, torch.float32) buf3 = buf2 del buf2 buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32) # Source Nodes: [A], Original ATen: [aten.add] stream0 = get_cuda_stream(0) triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0) del buf1 del buf3 # Source Nodes: [A], Original ATen: [aten.add] buf5 = aten.view.dtype(buf4, torch.complex64) buf6 = buf5 del buf5 # Source Nodes: [add_2], Original ATen: [aten.add] buf7 = aten.view.dtype(buf6, torch.float32) del buf6 buf8 = buf7 del buf7 # Source Nodes: [Z], Original ATen: [aten.add] buf9 = aten.view.dtype(arg0_1, torch.float32) del arg0_1 buf10 = buf9 del buf9 # Source Nodes: [Z], Original ATen: [aten.add] buf11 = aten.view.dtype(arg1_1, torch.float32) del arg1_1 buf12 = buf11 del buf11 buf13 = buf4; del buf4 # reuse # Source Nodes: [Z], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf10, buf12, buf13, 6, grid=grid(6), stream=stream0) del buf10 del buf12 # Source Nodes: [Z], Original ATen: [aten.add] buf14 = aten.view.dtype(buf13, torch.complex64) buf15 = buf14 del buf14 # Source Nodes: [add_2], Original ATen: [aten.add] buf16 = aten.view.dtype(buf15, torch.float32) del buf15 buf17 = buf16 del buf16 buf18 = buf13; del buf13 # reuse # Source Nodes: [add_2], Original ATen: [aten.add] triton_poi_fused_add_0.run(buf8, buf17, buf18, 6, grid=grid(6), stream=stream0) del buf17 del buf8 # Source Nodes: [add_2], Original ATen: [aten.add] buf19 = aten.view.dtype(buf18, torch.complex64) del buf18 buf20 = buf19 del buf19 return (buf20, ) ``` Pull Request resolved: pytorch#111773 Approved by: https://github.com/jansel

github-actions bot added module: inductor ciflow/inductor labels Oct 6, 2023

htyu marked this pull request as draft October 6, 2023 18:48

htyu force-pushed the hoy branch 10 times, most recently from 54ab871 to c10c88c Compare October 9, 2023 07:14

htyu marked this pull request as ready for review October 9, 2023 16:35

htyu requested a review from jansel October 9, 2023 17:12

jansel requested changes Oct 10, 2023

View reviewed changes

htyu force-pushed the hoy branch 2 times, most recently from caa2d13 to 6e43e32 Compare October 10, 2023 04:18

htyu force-pushed the hoy branch from 6e43e32 to 2154564 Compare October 13, 2023 22:49

jansel requested changes Oct 15, 2023

View reviewed changes

torch/_inductor/ir.py Outdated Show resolved Hide resolved

torch/_inductor/graph.py Outdated Show resolved Hide resolved

torch/_inductor/lowering.py Outdated Show resolved Hide resolved

htyu force-pushed the hoy branch 2 times, most recently from b4d00b5 to 4bba606 Compare October 15, 2023 23:01

jansel requested changes Oct 16, 2023

View reviewed changes

lezcano reviewed Oct 16, 2023

View reviewed changes

pytorch-bot bot added the topic: not user facing topic category label Oct 23, 2023

pytorchmergebot added the merging label Oct 23, 2023

pytorchmergebot removed the merging label Oct 23, 2023

htyu added 5 commits October 23, 2023 17:33

[inductor] decomposition for complex addition

c91f9b2

Decompose real,imag and complex to allow for a single kernel

63bcb11

Avoid accessing imag of non-complex tensors

e8be014

Fixing aten.view aliasing issue for complex views.

b284ce9

Use standalone class implemenation

4ca773b

htyu force-pushed the hoy branch from 494bb73 to 4ca773b Compare October 24, 2023 00:34

pytorchmergebot added the merging label Oct 24, 2023

pytorchmergebot added Merged and removed merging labels Oct 24, 2023

pytorchmergebot closed this in 6977ba6 Oct 24, 2023

htyu mentioned this pull request Oct 26, 2023

Compiling complex-valued functions fails #98161

Closed

cw-tan mentioned this pull request May 8, 2024

torch.compile and complex numbers #125718

Open

lezcano mentioned this pull request May 14, 2024

torch.compile error: Attempting to broadcast a dimension of length 2 at -1 #125745

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] decomposition for complex addition #110740

[inductor] decomposition for complex addition #110740

htyu commented Oct 6, 2023 •

edited

pytorch-bot bot commented Oct 6, 2023 •

edited

linux-foundation-easycla bot commented Oct 6, 2023 •

edited

htyu commented Oct 9, 2023

jansel Oct 10, 2023

jansel Oct 10, 2023

htyu Oct 10, 2023 •

edited

jansel Oct 10, 2023

Chillee Oct 12, 2023

htyu Oct 12, 2023

Chillee Oct 12, 2023

jansel Oct 12, 2023

htyu Oct 12, 2023

EikanWang commented Oct 13, 2023

htyu commented Oct 13, 2023

jansel Oct 16, 2023

jansel Oct 16, 2023

htyu Oct 17, 2023 •

edited

lezcano left a comment

lezcano commented Oct 16, 2023

htyu commented Oct 23, 2023

htyu commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

htyu commented Oct 24, 2023

pytorchmergebot commented Oct 24, 2023

		@@ -611,6 +611,15 @@ def fn(x, y):

		self.common(fn, (x, y))

		def test_add_complex(self):

[inductor] decomposition for complex addition #110740

[inductor] decomposition for complex addition #110740

Conversation

htyu commented Oct 6, 2023 • edited

pytorch-bot bot commented Oct 6, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110740

✅ You can merge normally! (1 Unrelated Failure)

linux-foundation-easycla bot commented Oct 6, 2023 • edited

htyu commented Oct 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htyu Oct 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EikanWang commented Oct 13, 2023

htyu commented Oct 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htyu Oct 17, 2023 • edited

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

lezcano commented Oct 16, 2023

htyu commented Oct 23, 2023

htyu commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

Merge started

pytorchmergebot commented Oct 23, 2023

Merge failed

htyu commented Oct 24, 2023

pytorchmergebot commented Oct 24, 2023

Merge started

htyu commented Oct 6, 2023 •

edited

pytorch-bot bot commented Oct 6, 2023 •

edited

linux-foundation-easycla bot commented Oct 6, 2023 •

edited

htyu Oct 10, 2023 •

edited

htyu Oct 17, 2023 •

edited