Skip to content

Conversation

xiaowangintel
Copy link
Contributor

@xiaowangintel xiaowangintel commented Jul 8, 2025

Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

Model Test Result:

We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Copy link

pytorch-bot bot commented Jul 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157769

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e511d78 with merge base 255a04b (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: inductor (aoti) labels Jul 8, 2025
@xiaowangintel xiaowangintel changed the title Enable _int_mm on Intel GPU [WIP] Enable _int_mm on Intel GPU Jul 8, 2025
@xiaowangintel
Copy link
Contributor Author

@liangan1 please help to review this pr.

@guangyey guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Jul 8, 2025
@guangyey guangyey moved this to Pre-Review Required in PyTorch Intel Jul 8, 2025
Copy link
Contributor

github-actions bot commented Jul 8, 2025

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

Copy link
Contributor

github-actions bot commented Jul 8, 2025

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.


Caused by:

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you should enable

@skipIfXpu(
msg="The operator 'aten::_int_mm' is not currently implemented for the XPU device"
)
def test__int_mm(self):
class Model(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
def forward(self, x, y):
return torch._int_mm(x, y)
example_inputs = (
torch.randint(-10, 10, (64, 32), device=self.device, dtype=torch.int8),
torch.randint(-10, 10, (32, 64), device=self.device, dtype=torch.int8),
)
self.check_model(Model(), example_inputs)

and
@patches
@skipIfXpu(msg="XPU has not supported _int_mm yet")
def test__int_mm(self):
@torch.compile
def foo(a, b):
return torch._int_mm(a, b)
foo(
torch.randint(-10, 10, (64, 32), device=GPU_TYPE, dtype=torch.int8),
torch.randint(-10, 10, (32, 64), device=GPU_TYPE, dtype=torch.int8),
)
self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)


Tensor bias = at::Tensor();
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble));
Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhiweiYan-96 I found mat2_zero_points is an unused parameter in quantized_matmul, right? Could you help to check it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, weight_zp!=0 has no usage currently. This is due to pt2e operators are specifically designed for nn.Linear module and weight should have no zp.
By the way, @xiaowangintel you may set it m2_zp = std::nullopt to save 1 copy kernel and 1 kernel launch here, this should be helpful to low-precision inference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean Tensor mat2_zero_points = at::Tensor(); is enough?

}

Tensor bias = at::Tensor();
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that mat2_scales will be converted to kFloat inside quantized_matmul, so why do we need to change its dtype here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@pytorch-bot pytorch-bot bot added module: inductor and removed ciflow/xpu Run XPU CI tasks labels Jul 8, 2025
@guangyey guangyey changed the title [WIP] Enable _int_mm on Intel GPU Enable _int_mm on Intel GPU Jul 8, 2025
@parametrize("use_transpose_a", [True, False])
@parametrize("use_transpose_b", [True, False])
@parametrize("non_contig_type", [0, 1, 2])
def test__int_mm(
Copy link
Collaborator

@guangyey guangyey Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
The only nit: instead of adding the new unit test, could we share the existing one with another backend? such as

def test__int_mm_cpu(self, device, m, k, n, use_transpose_a, use_transpose_b, non_contig_type):

and
def test_out_dtype_inductor_decomp_trace(self) -> None:
def func(x, w):
return out_dtype(torch.ops.aten.mm.default, torch.int32, x, w)
w = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda")
x = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda")
# Check that make_fx with inductor decomps produces _int_mm
decomp_table = torch._inductor.decomposition.select_decomp_table()
gm = make_fx(func, decomp_table, tracing_mode="symbolic")(x, w)
self.assertExpectedInline(gm.code.strip(), """\
def forward(self, x_1, w_1):
_int_mm = torch.ops.aten._int_mm.default(x_1, w_1); x_1 = w_1 = None
return _int_mm""")
@unittest.skipIf(not TEST_CUDA, "cuda only")
def test_out_dtype_int_mm_default_trace(self) -> None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prerequisite for opening this test is to add xpu device. However, for other operator, a large number of tests will fail. In addition, @deng, Daisy will enbale this test script on Intel GPU. So, @deng, Daisy will add this test in this script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you could only enable these three UTs for Intel GPU, not all UTs in the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guangyey Per my understanding, to enable these 3 UTs, we need to enable the XPU device for all UTs and add a lot skip for other UTs, it may involve a lot of code changes, I prefer to keep the same logic as other UTs which means that we can firstly enable the XPU device for int_mm in the ported test_linear.py in the torch-xpu-op by @deng, and she will upstream these UT to the public test_linalg.py in the near future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangan1 , @xiaowangintel , do you know why adding XPU results in many failed cases? I'm trying to understand the feature gap.
By the way, we can extract the 3 UTs as another test class and then enable XPU for this newly added test class.

@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 8, 2025
Copy link

pytorch-bot bot commented Jul 8, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 8, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 8, 2025
@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 8, 2025

AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__addmm_activation(AtenTensorHandle self, AtenTensorHandle mat1, AtenTensorHandle mat2, double beta, double alpha, int32_t use_gelu, AtenTensorHandle* ret0);
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__fused_moving_avg_obs_fq_helper_functional(AtenTensorHandle self, AtenTensorHandle observer_on, AtenTensorHandle fake_quant_on, AtenTensorHandle running_min, AtenTensorHandle running_max, AtenTensorHandle scale, AtenTensorHandle zero_point, double averaging_const, int64_t quant_min, int64_t quant_max, int64_t ch_axis, int32_t per_row_fake_quant, int32_t symmetric_quant, AtenTensorHandle* ret0, AtenTensorHandle* ret1, AtenTensorHandle* ret2, AtenTensorHandle* ret3, AtenTensorHandle* ret4, AtenTensorHandle* ret5);
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__int_mm_out(AtenTensorHandle out, AtenTensorHandle self, AtenTensorHandle mat2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etaf please help review the change here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change auto generated by torchgen? Expected to be yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @desertfire to confirm if we are ok with this one or not?

Copy link
Contributor

@desertfire desertfire Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a discussion with @albanD offline. Adding a XPU entry is ok here, but we also want to make it clear that the XPU backend is experimental so we don't make any BC promises for it. Thus I plan to create an "_experimental" direcotry under torch/csrc/inductor/aoti_torch/generated/, movec_shim_xpu.h under there.


Tensor bias = at::Tensor();
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble));
Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, weight_zp!=0 has no usage currently. This is due to pt2e operators are specifically designed for nn.Linear module and weight should have no zp.
By the way, @xiaowangintel you may set it m2_zp = std::nullopt to save 1 copy kernel and 1 kernel launch here, this should be helpful to low-precision inference.


TORCH_CHECK(
self.dtype() == at::kChar,
": Expected self dtype to be of type int8 but got ",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
": Expected self dtype to be of type int8 but got ",
"Expected self dtype to be of type int8 but got ",

self.dtype());
TORCH_CHECK(
mat2.dtype() == at::kChar,
": Expected mat2 dtype to be of type int8 but got ",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
": Expected mat2 dtype to be of type int8 but got ",
"Expected mat2 dtype to be of type int8 but got ",


TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous.");

if (result.numel() == 0 || self.size(1) == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this check not check self.numel()? What's the meaning of self.size(1) == 0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaowangintel , could you please elaborate on self.size(1) == 0 and why not self.numel() == 0?

Comment on lines 529 to 530
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kFloat));
Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhiweiYan-96 , does oneDNN support both the scale and zero_point are scalars? I'm kind of concerned about the integration overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale and zp need to be wrapped to be xpu tensor to be used as the attribute inputs, the scalar is not supported yet.

Copy link
Collaborator

@ZhiweiYan-96 ZhiweiYan-96 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneDNN cannot support pass host scalar directly, currently. The H2D copy here is always existent. We can remove the h2d copy here after the fixing merged in oneDNN.

@parametrize("use_transpose_a", [True, False])
@parametrize("use_transpose_b", [True, False])
@parametrize("non_contig_type", [0, 1, 2])
def test__int_mm(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangan1 , @xiaowangintel , do you know why adding XPU results in many failed cases? I'm trying to understand the feature gap.
By the way, we can extract the 3 UTs as another test class and then enable XPU for this newly added test class.

@EikanWang
Copy link
Collaborator

@xiaowangintel , pls. help resovlve conflicts.

@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 23, 2025

TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous.");

if (result.numel() == 0 || self.size(1) == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaowangintel , could you please elaborate on self.size(1) == 0 and why not self.numel() == 0?

@EikanWang
Copy link
Collaborator

@etaf, Is the failed case a new failure?

@etaf
Copy link
Collaborator

etaf commented Jul 24, 2025

@etaf, Is the failed case a new failure?

Yes, I'll disable it first.

@guangyey guangyey moved this from Pre-Review Required to Review Required in PyTorch Intel Jul 25, 2025
@guangyey guangyey requested review from albanD, desertfire and malfet July 25, 2025 03:00
@EikanWang
Copy link
Collaborator

@albanD , @desertfire , may I know if you have any comments on this PR?

@xiaowangintel
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@guangyey
Copy link
Collaborator

guangyey commented Aug 1, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@guangyey
Copy link
Collaborator

guangyey commented Aug 2, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automation github-project-automation bot moved this from Review Required to Done in PyTorch Intel Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source release notes: inductor (aoti) release notes: xpu release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.