[Inductor] addmm + activation function fusion #158137

AaronWang04 · 2025-07-11T18:50:54Z

PR implements a pass in post_grad to fuse activation(add + mm)

This was previously done similarly here #106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion.

however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results

perf dash board

Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API

Graph module before and after this pass

Relu(addmm)

graph():
    %primals_1 : [num_users=1] = placeholder[target=primals_1]
    %primals_2 : [num_users=2] = placeholder[target=primals_2]
    %primals_3 : [num_users=2] = placeholder[target=primals_3]
    %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
    %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {})
    %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {})
    %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
    return (relu, primals_2, le, permute_1)
graph():
    %primals_1 : [num_users=1] = placeholder[target=primals_1]
    %primals_2 : [num_users=2] = placeholder[target=primals_2]
    %primals_3 : [num_users=2] = placeholder[target=primals_3]
    %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
    %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {})
    %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
    return (_addmm_activation_default, primals_2, le, permute_1)

Gelu (addmm)

graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {})
    %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {})
    %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {})
    %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {})
    %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {})
    %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {})
    %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {})
    %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {})
    return (mul_5,)
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True})
    return (_addmm_activation_default,)

Benchmark setup:
NGC pytorch 25.06 container
cublas version: 12.9.1.4
torch.compile ran with dynamic = False and max_autotune

H100

Testing with M=1024, N=1024, K=1024, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.0107 ms
Average Time per Iteration (torch compile):	 0.0296 ms

============================================================
Testing with M=2048, N=2048, K=2048, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.0262 ms
Average Time per Iteration (torch compile):	 0.0327 ms

============================================================
Testing with M=4096, N=4096, K=4096, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.1763 ms
Average Time per Iteration (torch compile):	 0.2457 ms

============================================================
Testing with M=8192, N=8192, K=8192, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 1.5280 ms
Average Time per Iteration (torch compile):	 1.9437 ms

A100

############################################################
Testing with dtype: float16
############################################################

============================================================
Testing with M=1024, N=1024, K=1024, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.0313 ms
Average Time per Iteration (torch compile):	 0.0643 ms

============================================================
Testing with M=2048, N=2048, K=2048, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.1149 ms
Average Time per Iteration (torch compile):	 0.1255 ms

============================================================
Testing with M=4096, N=4096, K=4096, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.6297 ms
Average Time per Iteration (torch compile):	 0.7547 ms

============================================================
Testing with M=8192, N=8192, K=8192, dtype=float16
============================================================
Average Time per Iteration (cublas):	 4.3821 ms
Average Time per Iteration (torch compile):	 5.0740 ms

Script

import torch
torch.manual_seed(0)

warmup, numrun= 10, 100

sizes = [1024, 2048, 4096, 8192]
dtypes = [torch.float16, torch.bfloat16, torch.float32]

device = torch.device("cuda")

for dtype in dtypes:
    dtype_name = str(dtype).split('.')[-1] 
    print(f"\n{'#'*60}")
    print(f"Testing with dtype: {dtype_name}")
    print(f"{'#'*60}")
    
    for size in sizes:
        M, N, K = size, size, size
        print(f"\n{'='*60}")
        print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}")
        print(f"{'='*60}")
        
        A = torch.randn(M, K, device=device, dtype=dtype)
        B = torch.randn(K, N, device=device, dtype=dtype)
        C = torch.randn(M, device=device, dtype=dtype)

        def func1():
            return torch._addmm_activation(C, A, B, use_gelu=True)

        def func2():
            return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh")

        func2_compiled = torch.compile(
            func2,
            dynamic=False, 
            options={
                "force_disable_caches": True,
                "max_autotune": True,
                "max_autotune_gemm": True,
                "max_autotune_gemm_backends": "TRITON",
                "autotune_fallback_to_aten": False,
            }
        )

        for _ in range(warmup): func1()
        torch.cuda.synchronize(device=device)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        total_time_ms = 0.0
        start_event.record()
        for _ in range(numrun): func1()
        end_event.record()
        torch.cuda.synchronize(device=device)
        total_time_ms += start_event.elapsed_time(end_event)
        avg_time_ms = total_time_ms / numrun

        print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms")

        for _ in range(warmup): func2_compiled()
        torch.cuda.synchronize(device=device)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        total_time_ms = 0.0
        start_event.record()
        for _ in range(numrun): func2_compiled()
        end_event.record()
        torch.cuda.synchronize(device=device)
        total_time_ms += start_event.elapsed_time(end_event)
        avg_time_ms = total_time_ms / numrun

        print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms")

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

pytorch-bot · 2025-07-11T18:50:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158137

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 38 New Failures, 1 Unrelated Failure

As of commit 1974ada with merge base a9fabeb ():

NEW FAILURES - The following jobs have failed:

inductor-periodic / cuda12.8-py3.10-gcc9-sm86 / test (aot_inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
DistillGPT2
inductor-periodic / cuda12.8-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_BigBird
inductor-periodic / cuda12.8-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
mobilenetv3_large_100
inductor-periodic / cuda12.8-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
mobilenetv3_large_100
inductor-periodic / cuda12.8-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_BigBird
inductor-periodic / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_freezing_huggingface, 1, 1, linux.8xlarge.amx) (gh)
DistillGPT2
inductor-periodic / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_amp_freezing_timm, 1, 2, linux.8xlarge.amx) (gh)
dla102
inductor-periodic / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_amp_freezing_torchbench, 2, 2, linux.8xlarge.amx) (gh)
vision_maskrcnn
inductor-periodic / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_timm, 1, 2, linux.8xlarge.amx) (gh)
dla102
inductor-periodic / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_aot_inductor_freezing_torchbench, 1, 2, linux.8xlarge.amx) (gh)
basic_gnn_sage
inductor-periodic / linux-jammy-cpu-py3.9-gcc11-periodic-dynamo-benchmarks / test (cpu_inductor_freezing_avx2_timm, 1, 2, linux.10xlarge.avx2) (gh)
dla102
pull / linux-docs / build-docs-cpp-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-functorch-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-python-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-cuda12.8-cudnn9-py3.10-clang12 / build (gh)
pull / linux-jammy-py3.10-clang12 / test (crossref, 1, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (crossref, 2, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 1, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 2, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 3, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 4, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 5, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 1, 3, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 3, 3, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (einops, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (backwards_compat, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 1, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 2, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 3, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 4, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 5, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (docs_test, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (jit_legacy, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (numpy_2_x, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
torchbench / cuda12.8-py3.10-gcc9-sm80 / test (torchbench_gcp_smoketest, 1, 1, linux.aws.a100) (gh)
Process completed with exit code 1.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor-periodic / cuda12.8-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.aws.a100, unstable) (gh) (#161295)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

AaronWang04 · 2025-07-11T21:24:48Z

@pytorchbot label "topic: not user facing"

eqy · 2025-07-11T23:58:38Z

needs some benchmarks comparing against existing Triton fusions

eellison

Thanks for PR. A couple comments.

eellison · 2025-07-16T10:21:40Z

torch/_inductor/fx_passes/post_grad.py

+def addmm_gelu_pattern(input, mat1, mat2):
+    output = aten.mm(mat1, mat2)
+    output = aten.add()
+    return aten.gelu(output)


Is it also worth adding a pattern that targets addmm, instead of mm? I imagine most of the addmms will not be decomposed.

the addmms that have an activation after it will be decomposed which is what we are interested in pattern matching

torch/_inductor/fx_passes/post_grad.py

eellison · 2025-07-22T15:18:27Z

@AaronWang04 re-request when ready..

AaronWang04 · 2025-07-22T17:30:28Z

@pytorchbot rebase

pytorchmergebot · 2025-07-22T17:32:05Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-07-22T17:32:10Z

Successfully rebased addmm_activation_fusion onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout addmm_activation_fusion && git pull --rebase)

eellison

Did you have a chance to run any dashboards, just curious ? Anyway looks good but it would be nice if you could use the gen_register_replacement api to avoid compile time overhead. It's more required for training patterns but still nice either way..

eellison · 2025-07-25T14:33:38Z

torch/_inductor/fx_passes/post_grad.py

+    args_bf16 = [torch.empty(shape, dtype=torch.bfloat16) for shape in shapes]
+
+    for pattern in [addmm_relu_pattern, addmm_relu_pattern_2]:
+        register_replacement(


Now that we are parameterizing this across 4 total patterns.. Could I trouble you to pre-register the pattern ? See gen_register_replacement:

## Precompiled Patterns New patterns are added using register_replacement(). Patterns added in this way can have a compile-time overhead because they need to be traced before use. Patterns can be precompiled and added using gen_register_replacement() instead. To do this you call gen_register_replacement() instead of register_replacement(). The arguments are the same except for an additional unique name which is used as a lookup key.

And https://github.com/pytorch/pytorch/blob/main/torchgen/fuse/gen_patterns.py.

eellison · 2025-07-25T14:35:27Z

torch/_inductor/fx_passes/post_grad.py

+def addmm_gelu_pattern_2(input, mat1, mat2):
+    output = aten.mm(mat1, mat2)
+    output = aten.add(input, output)
+    return aten.gelu(output, approximate="tanh")


It's kind of unfortunate cublas only support "tanh"... which we dont use by default in pytorch.

eqy · 2025-07-25T18:29:53Z

@eellison For my edification, what's the dashboard, is it ciflow/inductor-perf-compare?

eellison · 2025-07-25T19:23:51Z

@eqy trigger run on https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml or https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-h100.yml.

Result will show up https://hud.pytorch.org/benchmark/compilers

AaronWang04 · 2025-07-28T18:01:52Z

@eellison the dashboard ran, results are under branch "AaronWang04_addmmfusion_perftest"

not sure where the slow downs are from, whether it is weird addmm_activation shapes that are slower than triton or if the benchmarks have high variance

test/inductor/test_pattern_matcher.py

AaronWang04 · 2025-08-06T17:54:15Z

@eellison Ran the dashboard again with gen_register_replacement

Pasted an image of it on the top comment

pytorchmergebot · 2025-08-20T18:18:55Z

Successfully rebased addmm_activation_fusion onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout addmm_activation_fusion && git pull --rebase)

PR implements a pass in post_grad to fuse activation(add + mm) This was previously done similarly here pytorch#106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion. however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results perf dash board <img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" /> Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API Graph module before and after this pass Relu(addmm) ``` graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (relu, primals_2, le, permute_1) graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (_addmm_activation_default, primals_2, le, permute_1) ``` Gelu (addmm) ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {}) %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {}) %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {}) %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {}) %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {}) %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {}) %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {}) %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {}) return (mul_5,) graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True}) return (_addmm_activation_default,) ``` Benchmark setup: NGC pytorch 25.06 container cublas version: 12.9.1.4 torch.compile ran with dynamic = False and max_autotune H100 ``` Testing with M=1024, N=1024, K=1024, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0107 ms Average Time per Iteration (torch compile): 0.0296 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0262 ms Average Time per Iteration (torch compile): 0.0327 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.1763 ms Average Time per Iteration (torch compile): 0.2457 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 1.5280 ms Average Time per Iteration (torch compile): 1.9437 ms ``` A100 ``` ############################################################ Testing with dtype: float16 ############################################################ ============================================================ Testing with M=1024, N=1024, K=1024, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.0313 ms Average Time per Iteration (torch compile): 0.0643 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.1149 ms Average Time per Iteration (torch compile): 0.1255 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.6297 ms Average Time per Iteration (torch compile): 0.7547 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=float16 ============================================================ Average Time per Iteration (cublas): 4.3821 ms Average Time per Iteration (torch compile): 5.0740 ms ``` Script ```py import torch torch.manual_seed(0) warmup, numrun= 10, 100 sizes = [1024, 2048, 4096, 8192] dtypes = [torch.float16, torch.bfloat16, torch.float32] device = torch.device("cuda") for dtype in dtypes: dtype_name = str(dtype).split('.')[-1] print(f"\n{'#'*60}") print(f"Testing with dtype: {dtype_name}") print(f"{'#'*60}") for size in sizes: M, N, K = size, size, size print(f"\n{'='*60}") print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}") print(f"{'='*60}") A = torch.randn(M, K, device=device, dtype=dtype) B = torch.randn(K, N, device=device, dtype=dtype) C = torch.randn(M, device=device, dtype=dtype) def func1(): return torch._addmm_activation(C, A, B, use_gelu=True) def func2(): return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh") func2_compiled = torch.compile( func2, dynamic=False, options={ "force_disable_caches": True, "max_autotune": True, "max_autotune_gemm": True, "max_autotune_gemm_backends": "TRITON", "autotune_fallback_to_aten": False, } ) for _ in range(warmup): func1() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func1() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms") for _ in range(warmup): func2_compiled() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func2_compiled() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms") ``` Pull Request resolved: pytorch#158137 Approved by: https://github.com/eellison

This reverts commit b9d7de3. Reverted pytorch#158137 on behalf of https://github.com/malfet due to Broke inductor torchbench, see https://hud.pytorch.org/hud/pytorch/pytorch/663da17b622d0c56f288b32b31276166dbed761e/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](pytorch#158137 (comment)))

eqy · 2025-08-22T20:31:59Z

@pytorchmergebot label ciflow/inductor ciflow/torchbench ciflow/inductor-periodic

eqy · 2025-08-28T23:30:34Z

@pytorchmergebot label ciflow/inductor ciflow/torchbench ciflow/inductor-periodic

PR implements a pass in post_grad to fuse activation(add + mm) This was previously done similarly here pytorch#106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion. however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results perf dash board <img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" /> Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API Graph module before and after this pass Relu(addmm) ``` graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (relu, primals_2, le, permute_1) graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (_addmm_activation_default, primals_2, le, permute_1) ``` Gelu (addmm) ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {}) %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {}) %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {}) %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {}) %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {}) %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {}) %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {}) %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {}) return (mul_5,) graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True}) return (_addmm_activation_default,) ``` Benchmark setup: NGC pytorch 25.06 container cublas version: 12.9.1.4 torch.compile ran with dynamic = False and max_autotune H100 ``` Testing with M=1024, N=1024, K=1024, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0107 ms Average Time per Iteration (torch compile): 0.0296 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0262 ms Average Time per Iteration (torch compile): 0.0327 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.1763 ms Average Time per Iteration (torch compile): 0.2457 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 1.5280 ms Average Time per Iteration (torch compile): 1.9437 ms ``` A100 ``` ############################################################ Testing with dtype: float16 ############################################################ ============================================================ Testing with M=1024, N=1024, K=1024, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.0313 ms Average Time per Iteration (torch compile): 0.0643 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.1149 ms Average Time per Iteration (torch compile): 0.1255 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.6297 ms Average Time per Iteration (torch compile): 0.7547 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=float16 ============================================================ Average Time per Iteration (cublas): 4.3821 ms Average Time per Iteration (torch compile): 5.0740 ms ``` Script ```py import torch torch.manual_seed(0) warmup, numrun= 10, 100 sizes = [1024, 2048, 4096, 8192] dtypes = [torch.float16, torch.bfloat16, torch.float32] device = torch.device("cuda") for dtype in dtypes: dtype_name = str(dtype).split('.')[-1] print(f"\n{'#'*60}") print(f"Testing with dtype: {dtype_name}") print(f"{'#'*60}") for size in sizes: M, N, K = size, size, size print(f"\n{'='*60}") print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}") print(f"{'='*60}") A = torch.randn(M, K, device=device, dtype=dtype) B = torch.randn(K, N, device=device, dtype=dtype) C = torch.randn(M, device=device, dtype=dtype) def func1(): return torch._addmm_activation(C, A, B, use_gelu=True) def func2(): return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh") func2_compiled = torch.compile( func2, dynamic=False, options={ "force_disable_caches": True, "max_autotune": True, "max_autotune_gemm": True, "max_autotune_gemm_backends": "TRITON", "autotune_fallback_to_aten": False, } ) for _ in range(warmup): func1() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func1() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms") for _ in range(warmup): func2_compiled() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func2_compiled() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms") ``` Pull Request resolved: pytorch#158137 Approved by: https://github.com/eellison

This reverts commit b9d7de3. Reverted pytorch#158137 on behalf of https://github.com/malfet due to Broke inductor torchbench, see https://hud.pytorch.org/hud/pytorch/pytorch/663da17b622d0c56f288b32b31276166dbed761e/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](pytorch#158137 (comment)))

nikitaved · 2025-09-22T13:18:30Z

torch/_inductor/fx_passes/post_grad.py

+    # do not fuse if there are pointwise ops after
+    return not all(is_pointwise_use(use) for use in output.users)


I guess we should not fuse as long as there is a pointwise op as per the comment, right? The current logic does not fuse only when pointwise ops are the only consumers.

nikitaved · 2025-09-22T13:54:35Z

torch/_inductor/fx_passes/post_grad.py

+def addmm_gelu_pattern(input, mat1, mat2):
+    output = aten.mm(mat1, mat2)
+    output = aten.add(output, input)
+    return aten.gelu(output, approximate="tanh")


I am definitely missing some context here. But could you please explain why there is no addmm pattern? Are add and mm parts always decoupled?

Perhaps it gets decomposed into add and mm anyway? I forget the details here

yeah I registered this to pass_patterns[2] and moved addmm unfuse to pass_patterns[1]

Ah, ok, so there is an intention here. Maybe dropping a comment about pass_pattern could be helpful.

shunting314 · 2025-09-22T18:39:48Z

torch/_inductor/fx_passes/post_grad.py

+    args_bf16 = [torch.empty(shape, dtype=torch.bfloat16) for shape in shapes]
+
+    for pattern in [addmm_relu_pattern, addmm_relu_pattern_2]:
+        name = f"{pattern.__name__}_fp32"


Any specific reason that there is not a registration for fp16 for the relu case?

bf16 and fp16's fx graph look the exact same (theres an upcast to fp32 for the intermediates)
whereas fp32 is different due to no upcasts

oh, sorry, I meant bf16 since I saw you have a bf16 specific registration for gelu but not relu

ah ok, same logic there, gelu needs upcast but relu does not

nikitaved · 2025-09-23T11:06:26Z

torch/_inductor/fx_passes/post_grad.py

+def register_addmm_activation_fusion():
+    shapes = [(5,), (3, 4), (4, 5)]
+    args_fp32 = [torch.empty(shape) for shape in shapes]


Just for my personal understanding. Regarding these shapes -- are they arbitrary?

yes, it really doesn't matter what shapes these are, its just for running the function to generate the dynamo graph

nikitaved · 2025-09-23T11:58:36Z

torch/_inductor/fx_passes/post_grad.py

+            name = f"{pattern.__name__}_{dtype_suffix}"
+            gen_register_replacement(
+                name,


There must be a way to tell that the substitute is only valid in the inference mode, otherwise, as I see it, the backward will not work since _addmm_activation does not even have a grad_fn function registered for it.

yeah its traced with fwd_only

the backward will not work with _addmm_activation (tho I think cublas supports it now, its just that the native function havent been updated to reflect this)

pytorch-bot bot added the module: inductor label Jul 11, 2025

pytorchbot added the open source label Jul 11, 2025

pytorch-bot bot added the topic: not user facing topic category label Jul 11, 2025

eqy requested a review from eellison July 11, 2025 23:58

eqy added the matrix multiplication label Jul 11, 2025

AaronWang04 marked this pull request as ready for review July 12, 2025 00:04

mikaylagawarecki requested a review from shunting314 July 14, 2025 15:03

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 14, 2025

eellison reviewed Jul 16, 2025

View reviewed changes

torch/_inductor/fx_passes/post_grad.py Show resolved Hide resolved

eellison self-requested a review July 17, 2025 21:58

pytorch-bot bot added ciflow/inductor and removed ciflow/inductor labels Jul 18, 2025

eellison removed their request for review July 22, 2025 15:18

pytorchmergebot force-pushed the addmm_activation_fusion branch from bbb1083 to 0f9acc5 Compare July 22, 2025 17:32

AaronWang04 requested a review from eellison July 23, 2025 21:06

eellison previously approved these changes Jul 25, 2025

View reviewed changes

etaf previously requested changes Aug 6, 2025

View reviewed changes

test/inductor/test_pattern_matcher.py Outdated Show resolved Hide resolved

eellison previously approved these changes Aug 8, 2025

View reviewed changes

AaronWang04 added 4 commits August 20, 2025 18:18

update test

ad877d4

idrk

2d76f99

test

82d6d55

test

77ff3db

pytorchmergebot force-pushed the addmm_activation_fusion branch from ff89911 to 77ff3db Compare August 20, 2025 18:18

pytorch-bot bot removed ciflow/inductor ciflow/torchbench ciflow/inductor-periodic labels Aug 20, 2025

pytorch-bot bot added ciflow/inductor ciflow/inductor-periodic ciflow/torchbench labels Aug 22, 2025

change tolerance

1974ada

pytorch-bot bot removed ciflow/inductor ciflow/torchbench ciflow/inductor-periodic labels Aug 28, 2025

pytorch-bot bot added ciflow/inductor ciflow/inductor-periodic ciflow/torchbench labels Aug 28, 2025

nikitaved reviewed Sep 22, 2025

View reviewed changes

shunting314 reviewed Sep 22, 2025

View reviewed changes

nikitaved reviewed Sep 23, 2025

View reviewed changes

		# do not fuse if there are pointwise ops after
		return not all(is_pointwise_use(use) for use in output.users)

[Inductor] addmm + activation function fusion #158137

Are you sure you want to change the base?

[Inductor] addmm + activation function fusion #158137

Uh oh!

Conversation

AaronWang04 commented Jul 11, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158137

❌ 38 New Failures, 1 Unrelated Failure

Uh oh!

AaronWang04 commented Jul 11, 2025

Uh oh!

eqy commented Jul 11, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eellison commented Jul 22, 2025

Uh oh!

AaronWang04 commented Jul 22, 2025

Uh oh!

pytorchmergebot commented Jul 22, 2025

Uh oh!

pytorchmergebot commented Jul 22, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Jul 25, 2025

Uh oh!

eellison commented Jul 25, 2025

Uh oh!

AaronWang04 commented Jul 28, 2025

Uh oh!

Uh oh!

AaronWang04 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Aug 20, 2025

Uh oh!

eqy commented Aug 22, 2025

Uh oh!

eqy commented Aug 28, 2025

Uh oh!

nikitaved Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AaronWang04 commented Jul 11, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 11, 2025 •

edited

Loading

AaronWang04 commented Aug 6, 2025 •

edited

Loading

nikitaved Sep 22, 2025 •

edited

Loading

nikitaved Sep 23, 2025 •

edited

Loading