[CPU] Add concat-linear fusion pass for da8w4 #2476

Xia-Weiwen · 2025-07-02T09:43:55Z

Summary
This PR adds a concat-linear fusion pass for da8w4 on CPU. The pass fuses the following pattern

    da8w4_linear_cpu(x, ..., w1, ...) -- y1
  /
x --da8w4_linear_cpu(x, ..., w2, ...) -- y2
  \...
    da8w4_linear_cpu(x, ..., wN, ...) -- yN

to

x -- da8w4_linear_cpu(x, ..., w_concat, ...) -- y_concat -- split -- (y1, y2, yN)

The fusion pass is registered as a custom post_grad pass in Inductor. The pass takes effect only when torch._inductor.config.cpp.enable_concat_linear is true.

Benchmarks show that total CPU time of linear is reduced by >5% with concat linear when running Llama3.1-8B with 32 cores on a 6th gen of Intel(R) Xeon(R).

Test plan

pytest test/quantization/test_da8w4_cpu.py -k test_8da4w_concat_linear_cpu

pytorch-bot · 2025-07-02T09:43:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2476

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit e125d05 with merge base 64c1ce3 ():

NEW FAILURE - The following job has failed:

Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh)
Process completed with exit code 127.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xia-Weiwen · 2025-07-02T09:48:02Z

Hi @CaoE Could you please also review this PR? I cannot add you as a reviewer.

torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py

torchao/Inductor_passes/cpu.py

Xia-Weiwen · 2025-07-04T09:28:38Z

Hi @jerryzh168 Could you please review this PR? This PR adds new Inductor passes and we would like to hear your suggestions on where to put the code. Thanks.

leslie-fang-intel · 2025-07-08T00:12:30Z

torchao/prototype/inductor/fx_passes/da8w4_concat_linear_fusion_cpu.py

+def register_da8w4_concat_linear_cpu_pass():
+    from torch._inductor import config as inductor_config
+
+    inductor_config.post_grad_custom_post_pass = _concat_linear_dq8w4_cpu


I found that we can get gm by graph.owning_module so we can use post_grad_custom_post_pass to apply the pass. Then we don't need to use the register_backend_for_device API. Thanks.

But this one may silently cause conflict right?

Yes, exactly. We need to extend the passes to lists in PyTorch.

cc @jansel @eellison

Yes, exactly. We need to extend the passes to lists in PyTorch.

Still feels extend current design to avoid this conflict will be a better solution. Let's add some notes for the potential conflict at least.

Hi @jansel @eellison I think we need to extend the custom passes in Inductor (to make them lists). I will probably submit a PR for it. The custom passes added by pytorch/pytorch#154841 does not meet our needs. Do you have comments? Thanks.

List registration makes sense to me.

test/quantization/test_quant_api.py

jerryzh168 · 2025-07-08T01:58:08Z

LGTM, @Xia-Weiwen I had a message in slack, wondering if you want to migrate the cpu related stuff to experimental folder to be more consistent with the rest of the CPU kernels: https://github.com/pytorch/ao/tree/main/torchao/experimental

Xia-Weiwen · 2025-07-08T02:03:20Z

LGTM, @Xia-Weiwen I had a message in slack, wondering if you want to migrate the cpu related stuff to experimental folder to be more consistent with the rest of the CPU kernels: https://github.com/pytorch/ao/tree/main/torchao/experimental

Thanks for reviewing and sorry for the late reply.

Xia-Weiwen · 2025-07-08T07:32:34Z

@pytorchbot merge

pytorch-bot · 2025-07-08T07:32:39Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

Xia-Weiwen · 2025-07-10T01:36:08Z

@pytorchbot merge

pytorchmergebot · 2025-07-10T01:36:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-07-10T01:36:56Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

Xia-Weiwen · 2025-07-10T10:04:36Z

@pytorchbot merge -f "CI failures are unrelated"

pytorchmergebot · 2025-07-10T10:05:10Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit bc66df2.

facebook-github-bot added the CLA Signed label Jul 2, 2025

Xia-Weiwen requested review from leslie-fang-intel and Valentine233 July 2, 2025 09:44

Xia-Weiwen added the topic: not user facing label Jul 2, 2025

Xia-Weiwen added the cpu label Jul 2, 2025

Xia-Weiwen added 2 commits July 2, 2025 17:20

[CPU] Add concat-linear fusion pass for da8w4

ed801a8

Merge branch 'main' into da8w4_concat_linear

0ccda8b

leslie-fang-intel requested changes Jul 3, 2025

View reviewed changes

torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py Outdated Show resolved Hide resolved

torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py Outdated Show resolved Hide resolved

Xia-Weiwen requested a review from leslie-fang-intel July 3, 2025 03:07

leslie-fang-intel reviewed Jul 3, 2025

View reviewed changes

torchao/Inductor_passes/cpu.py Outdated Show resolved Hide resolved

Xia-Weiwen requested a review from leslie-fang-intel July 3, 2025 07:42

Xia-Weiwen added 2 commits July 3, 2025 10:56

Move FX passes to a separate location

e172cf7

Improve block_m

5cf613b

CaoE approved these changes Jul 4, 2025

View reviewed changes

Xia-Weiwen requested a review from jerryzh168 July 4, 2025 09:27

Xia-Weiwen added 4 commits July 5, 2025 22:43

Refine code

d361e22

Merge branch 'main' into da8w4_concat_linear

26ba33b

Move the fusion pass to torchao/prototype/inductor/fx_passes

ceb8dfa

Add the missing file

ffcf333

leslie-fang-intel reviewed Jul 8, 2025

View reviewed changes

Xia-Weiwen requested a review from leslie-fang-intel July 8, 2025 01:55

jerryzh168 reviewed Jul 8, 2025

View reviewed changes

test/quantization/test_quant_api.py Outdated Show resolved Hide resolved

Xia-Weiwen requested a review from jerryzh168 July 8, 2025 03:42

jerryzh168 approved these changes Jul 8, 2025

View reviewed changes

leslie-fang-intel approved these changes Jul 8, 2025

View reviewed changes

Xia-Weiwen added 2 commits July 8, 2025 11:14

Merge branch 'main' into da8w4_concat_linear

d8d19ab

put tests for da8w4 cpu in a separate file

e959fdb

pytorchmergebot added the merging label Jul 10, 2025

pytorchmergebot removed the merging label Jul 10, 2025

pytorchmergebot added the merging label Jul 10, 2025

pytorchmergebot closed this in f24f37b Jul 10, 2025

pytorchmergebot added Merged and removed merging labels Jul 10, 2025

Xia-Weiwen added 4 commits July 10, 2025 10:18

Debug CI

bc66df2

Revert "Debug CI"

fa9405d

This reverts commit bc66df2.

Merge branch 'main' into da8w4_concat_linear

188bbbb

Add comments for packed weight/scales/qzeros/compenstaion

e125d05

[CPU] Add concat-linear fusion pass for da8w4 #2476

[CPU] Add concat-linear fusion pass for da8w4 #2476

Uh oh!

Conversation

Xia-Weiwen commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2476

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Xia-Weiwen commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Jul 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerryzh168 commented Jul 8, 2025

Uh oh!

Xia-Weiwen commented Jul 8, 2025

Uh oh!

Xia-Weiwen commented Jul 8, 2025

Uh oh!

pytorch-bot bot commented Jul 8, 2025

Uh oh!

Xia-Weiwen commented Jul 10, 2025

Uh oh!

pytorchmergebot commented Jul 10, 2025

Merge started

Uh oh!

pytorchmergebot commented Jul 10, 2025

Merge failed

Uh oh!

Xia-Weiwen commented Jul 10, 2025

Uh oh!

pytorchmergebot commented Jul 10, 2025

Merge started

Uh oh!

Uh oh!

Xia-Weiwen commented Jul 2, 2025 •

edited

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading