Enable _int_mm on Intel GPU #157769

xiaowangintel · 2025-07-08T02:54:49Z

Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

Model Test Result:

We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-07-08T02:54:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157769

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e511d78 with merge base 255a04b ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 4, 6, linux.idc.xpu) (gh) (disabled by #159334 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_package.py::TestPackage::test_automatic_dynamo_autotune_cache_device_xpu

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xiaowangintel · 2025-07-08T02:56:44Z

@liangan1 please help to review this pr.

github-actions · 2025-07-08T02:58:46Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2025-07-08T02:58:46Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h

guangyey

I assume you should enable

pytorch/test/inductor/test_aot_inductor.py

Lines 6023 to 6038 in 3ee8828

    
               @skipIfXpu( 
        
                   msg="The operator 'aten::_int_mm' is not currently implemented for the XPU device" 
        
               ) 
        
               def test__int_mm(self): 
        
                   class Model(torch.nn.Module): 
        
                       def __init__(self) -> None: 
        
                           super().__init__() 
        
                       def forward(self, x, y): 
        
                           return torch._int_mm(x, y) 
        
                   example_inputs = ( 
        
                       torch.randint(-10, 10, (64, 32), device=self.device, dtype=torch.int8), 
        
                       torch.randint(-10, 10, (32, 64), device=self.device, dtype=torch.int8), 
        
                   ) 
        
                   self.check_model(Model(), example_inputs)

and

pytorch/test/inductor/test_select_algorithm.py

Lines 148 to 159 in 3ee8828

    
           @patches 
        
           @skipIfXpu(msg="XPU has not supported _int_mm yet") 
        
           def test__int_mm(self): 
        
               @torch.compile 
        
               def foo(a, b): 
        
                   return torch._int_mm(a, b) 
        
               foo( 
        
                   torch.randint(-10, 10, (64, 32), device=GPU_TYPE, dtype=torch.int8), 
        
                   torch.randint(-10, 10, (32, 64), device=GPU_TYPE, dtype=torch.int8), 
        
               ) 
        
               self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)

guangyey · 2025-07-08T03:28:58Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+
+  Tensor bias = at::Tensor();
+  Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble));
+  Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));


@ZhiweiYan-96 I found mat2_zero_points is an unused parameter in quantized_matmul, right? Could you help to check it.

yes, weight_zp!=0 has no usage currently. This is due to pt2e operators are specifically designed for nn.Linear module and weight should have no zp.
By the way, @xiaowangintel you may set it m2_zp = std::nullopt to save 1 copy kernel and 1 kernel launch here, this should be helpful to low-precision inference.

Do you mean Tensor mat2_zero_points = at::Tensor(); is enough?

guangyey · 2025-07-08T03:35:27Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+  }
+
+  Tensor bias = at::Tensor();
+  Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble));


I noticed that mat2_scales will be converted to kFloat inside quantized_matmul, so why do we need to change its dtype here.

guangyey · 2025-07-08T06:04:05Z

test/xpu/test_gemm.py

+    @parametrize("use_transpose_a", [True, False])
+    @parametrize("use_transpose_b", [True, False])
+    @parametrize("non_contig_type", [0, 1, 2])
+    def test__int_mm(


Overall LGTM.
The only nit: instead of adding the new unit test, could we share the existing one with another backend? such as

pytorch/test/test_linalg.py

Line 7448 in 1b58e7a

def test__int_mm_cpu(self, device, m, k, n, use_transpose_a, use_transpose_b, non_contig_type):

and

pytorch/test/test_out_dtype_op.py

Lines 187 to 203 in 1b58e7a

def test_out_dtype_inductor_decomp_trace(self) -> None:

def func(x, w):

return out_dtype(torch.ops.aten.mm.default, torch.int32, x, w)

w = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda")

x = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda")

# Check that make_fx with inductor decomps produces _int_mm

decomp_table = torch._inductor.decomposition.select_decomp_table()

gm = make_fx(func, decomp_table, tracing_mode="symbolic")(x, w)

self.assertExpectedInline(gm.code.strip(), """\

def forward(self, x_1, w_1):

_int_mm = torch.ops.aten._int_mm.default(x_1, w_1); x_1 = w_1 = None

return _int_mm""")

@unittest.skipIf(not TEST_CUDA, "cuda only")

def test_out_dtype_int_mm_default_trace(self) -> None:

The prerequisite for opening this test is to add xpu device. However, for other operator, a large number of tests will fail. In addition, @deng, Daisy will enbale this test script on Intel GPU. So, @deng, Daisy will add this test in this script.

I mean you could only enable these three UTs for Intel GPU, not all UTs in the file.

@guangyey Per my understanding, to enable these 3 UTs, we need to enable the XPU device for all UTs and add a lot skip for other UTs, it may involve a lot of code changes, I prefer to keep the same logic as other UTs which means that we can firstly enable the XPU device for int_mm in the ported test_linear.py in the torch-xpu-op by @deng, and she will upstream these UT to the public test_linalg.py in the near future.

@liangan1 , @xiaowangintel , do you know why adding XPU results in many failed cases? I'm trying to understand the feature gap.
By the way, we can extract the 3 UTs as another test class and then enable XPU for this newly added test class.

pytorch-bot · 2025-07-08T06:05:06Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

ZhiweiYan-96 · 2025-07-09T02:27:49Z

torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h


 AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__addmm_activation(AtenTensorHandle self, AtenTensorHandle mat1, AtenTensorHandle mat2, double beta, double alpha, int32_t use_gelu, AtenTensorHandle* ret0);
 AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__fused_moving_avg_obs_fq_helper_functional(AtenTensorHandle self, AtenTensorHandle observer_on, AtenTensorHandle fake_quant_on, AtenTensorHandle running_min, AtenTensorHandle running_max, AtenTensorHandle scale, AtenTensorHandle zero_point, double averaging_const, int64_t quant_min, int64_t quant_max, int64_t ch_axis, int32_t per_row_fake_quant, int32_t symmetric_quant, AtenTensorHandle* ret0, AtenTensorHandle* ret1, AtenTensorHandle* ret2, AtenTensorHandle* ret3, AtenTensorHandle* ret4, AtenTensorHandle* ret5);
+AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__int_mm_out(AtenTensorHandle out, AtenTensorHandle self, AtenTensorHandle mat2);


@etaf please help review the change here

Does this change auto generated by torchgen? Expected to be yes.

cc @desertfire to confirm if we are ok with this one or not?

Had a discussion with @albanD offline. Adding a XPU entry is ok here, but we also want to make it clear that the XPU backend is experimental so we don't make any BC promises for it. Thus I plan to create an "_experimental" direcotry under torch/csrc/inductor/aoti_torch/generated/, movec_shim_xpu.h under there.

ZhiweiYan-96 · 2025-07-09T02:31:49Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+
+  Tensor bias = at::Tensor();
+  Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble));
+  Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));


yes, weight_zp!=0 has no usage currently. This is due to pt2e operators are specifically designed for nn.Linear module and weight should have no zp.
By the way, @xiaowangintel you may set it m2_zp = std::nullopt to save 1 copy kernel and 1 kernel launch here, this should be helpful to low-precision inference.

EikanWang · 2025-07-11T06:48:35Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+
+  TORCH_CHECK(
+      self.dtype() == at::kChar,
+      ": Expected self dtype to be of type int8 but got ",


Suggested change

": Expected self dtype to be of type int8 but got ",

"Expected self dtype to be of type int8 but got ",

EikanWang · 2025-07-11T06:48:54Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+      self.dtype());
+  TORCH_CHECK(
+      mat2.dtype() == at::kChar,
+      ": Expected mat2 dtype to be of type int8 but got ",


Suggested change

": Expected mat2 dtype to be of type int8 but got ",

"Expected mat2 dtype to be of type int8 but got ",

EikanWang · 2025-07-11T06:51:11Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+
+  TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous.");
+
+  if (result.numel() == 0 || self.size(1) == 0) {


Why does this check not check self.numel()? What's the meaning of self.size(1) == 0?

@xiaowangintel , could you please elaborate on self.size(1) == 0 and why not self.numel() == 0?

EikanWang · 2025-07-11T06:53:49Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+  Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kFloat));
+  Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));


@ZhiweiYan-96 , does oneDNN support both the scale and zero_point are scalars? I'm kind of concerned about the integration overhead.

The scale and zp need to be wrapped to be xpu tensor to be used as the attribute inputs, the scalar is not supported yet.

oneDNN cannot support pass host scalar directly, currently. The H2D copy here is always existent. We can remove the h2d copy here after the fixing merged in oneDNN.

EikanWang · 2025-07-11T07:00:58Z

test/xpu/test_gemm.py

+    @parametrize("use_transpose_a", [True, False])
+    @parametrize("use_transpose_b", [True, False])
+    @parametrize("non_contig_type", [0, 1, 2])
+    def test__int_mm(


@liangan1 , @xiaowangintel , do you know why adding XPU results in many failed cases? I'm trying to understand the feature gap.
By the way, we can extract the 3 UTs as another test class and then enable XPU for this newly added test class.

EikanWang · 2025-07-23T05:35:12Z

@xiaowangintel , pls. help resovlve conflicts.

EikanWang · 2025-07-23T05:33:57Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+
+  TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous.");
+
+  if (result.numel() == 0 || self.size(1) == 0) {


@xiaowangintel , could you please elaborate on self.size(1) == 0 and why not self.numel() == 0?

EikanWang · 2025-07-24T00:39:50Z

@etaf, Is the failed case a new failure?

etaf · 2025-07-24T01:08:00Z

@etaf, Is the failed case a new failure?

Yes, I'll disable it first.

EikanWang · 2025-07-29T03:40:51Z

@albanD , @desertfire , may I know if you have any comments on this PR?

xiaowangintel · 2025-08-01T06:42:16Z

@pytorchbot merge

pytorchmergebot · 2025-08-01T06:45:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-01T12:43:06Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

guangyey · 2025-08-01T12:54:16Z

@pytorchbot merge

pytorchmergebot · 2025-08-01T12:56:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-01T18:54:44Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

guangyey · 2025-08-02T05:13:42Z

@pytorchbot merge

pytorchmergebot · 2025-08-02T05:15:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xiaowangintel requested review from EikanWang, gujinghui and zou3519 as code owners July 8, 2025 02:54

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: inductor (aoti) labels Jul 8, 2025

xiaowangintel changed the title ~~Enable _int_mm on Intel GPU~~ [WIP] Enable _int_mm on Intel GPU Jul 8, 2025

guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Jul 8, 2025

guangyey added this to PyTorch Intel Jul 8, 2025

guangyey moved this to Pre-Review Required in PyTorch Intel Jul 8, 2025

pytorchbot added the open source label Jul 8, 2025

guangyey reviewed Jul 8, 2025

View reviewed changes

pytorch-bot bot added module: inductor and removed ciflow/xpu Run XPU CI tasks labels Jul 8, 2025

guangyey changed the title ~~[WIP] Enable _int_mm on Intel GPU~~ Enable _int_mm on Intel GPU Jul 8, 2025

pytorch-bot bot added the ciflow/inductor label Jul 8, 2025

guangyey reviewed Jul 8, 2025

View reviewed changes

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 8, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 8, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 8, 2025

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 8, 2025

ZhiweiYan-96 reviewed Jul 9, 2025

View reviewed changes

EikanWang reviewed Jul 11, 2025

View reviewed changes

xiaowangintel added 3 commits July 23, 2025 01:03

Enable _int_mm on Intel GPU

ad7fadb

Enable _int_mm on Intel GPU

8b68bf7

Enable _int_mm on Intel GPU

e511d78

xiaowangintel force-pushed the xw/int_mm_xpu branch from cf25f1c to e511d78 Compare July 23, 2025 08:08

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 23, 2025

EikanWang approved these changes Jul 24, 2025

View reviewed changes

guangyey moved this from Pre-Review Required to Review Required in PyTorch Intel Jul 25, 2025

guangyey requested review from albanD, desertfire and malfet July 25, 2025 03:00

desertfire approved these changes Jul 31, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025

pytorchmergebot added the merging label Aug 1, 2025

pytorchmergebot closed this in fd6a665 Aug 2, 2025

pytorchmergebot added the Merged label Aug 2, 2025

github-project-automation bot moved this from Review Required to Done in PyTorch Intel Aug 2, 2025

pytorchmergebot removed the merging label Aug 2, 2025

	@skipIfXpu(
	msg="The operator 'aten::_int_mm' is not currently implemented for the XPU device"
	)
	def test__int_mm(self):
	class Model(torch.nn.Module):
	def __init__(self) -> None:
	super().__init__()

	def forward(self, x, y):
	return torch._int_mm(x, y)

	example_inputs = (
	torch.randint(-10, 10, (64, 32), device=self.device, dtype=torch.int8),
	torch.randint(-10, 10, (32, 64), device=self.device, dtype=torch.int8),
	)
	self.check_model(Model(), example_inputs)

	@patches
	@skipIfXpu(msg="XPU has not supported _int_mm yet")
	def test__int_mm(self):
	@torch.compile
	def foo(a, b):
	return torch._int_mm(a, b)

	foo(
	torch.randint(-10, 10, (64, 32), device=GPU_TYPE, dtype=torch.int8),
	torch.randint(-10, 10, (32, 64), device=GPU_TYPE, dtype=torch.int8),
	)
	self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)

	def test_out_dtype_inductor_decomp_trace(self) -> None:
	def func(x, w):
	return out_dtype(torch.ops.aten.mm.default, torch.int32, x, w)

	w = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda")
	x = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda")

	# Check that make_fx with inductor decomps produces _int_mm
	decomp_table = torch._inductor.decomposition.select_decomp_table()
	gm = make_fx(func, decomp_table, tracing_mode="symbolic")(x, w)
	self.assertExpectedInline(gm.code.strip(), """\
	def forward(self, x_1, w_1):
	_int_mm = torch.ops.aten._int_mm.default(x_1, w_1); x_1 = w_1 = None
	return _int_mm""")

	@unittest.skipIf(not TEST_CUDA, "cuda only")
	def test_out_dtype_int_mm_default_trace(self) -> None:

	": Expected self dtype to be of type int8 but got ",
	"Expected self dtype to be of type int8 but got ",

	": Expected mat2 dtype to be of type int8 but got ",
	"Expected mat2 dtype to be of type int8 but got ",


		TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous.");

		if (result.numel() == 0 \|\| self.size(1) == 0) {

		Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kFloat));
		Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));

Enable _int_mm on Intel GPU #157769

Enable _int_mm on Intel GPU #157769

Uh oh!

Conversation

xiaowangintel commented Jul 8, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Moativation

Model Test Result:

Uh oh!

pytorch-bot bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157769

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

xiaowangintel commented Jul 8, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Attention! native_functions.yaml was changed

Uh oh!

github-actions bot commented Jul 8, 2025

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

desertfire Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaowangintel commented Jul 8, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 8, 2025 •

edited

Loading

guangyey Jul 8, 2025 •

edited

Loading

desertfire Jul 30, 2025 •

edited

Loading

ZhiweiYan-96 Jul 22, 2025 •

edited

Loading