-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Enable _int_mm on Intel GPU #157769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable _int_mm on Intel GPU #157769
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157769
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit e511d78 with merge base 255a04b ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@liangan1 please help to review this pr. |
Attention! native_functions.yaml was changedIf you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info. Caused by: |
Attention! PyTorch one of the C-stable API file was changedYou MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function. Caused by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you should enable
pytorch/test/inductor/test_aot_inductor.py
Lines 6023 to 6038 in 3ee8828
@skipIfXpu( | |
msg="The operator 'aten::_int_mm' is not currently implemented for the XPU device" | |
) | |
def test__int_mm(self): | |
class Model(torch.nn.Module): | |
def __init__(self) -> None: | |
super().__init__() | |
def forward(self, x, y): | |
return torch._int_mm(x, y) | |
example_inputs = ( | |
torch.randint(-10, 10, (64, 32), device=self.device, dtype=torch.int8), | |
torch.randint(-10, 10, (32, 64), device=self.device, dtype=torch.int8), | |
) | |
self.check_model(Model(), example_inputs) |
and
pytorch/test/inductor/test_select_algorithm.py
Lines 148 to 159 in 3ee8828
@patches | |
@skipIfXpu(msg="XPU has not supported _int_mm yet") | |
def test__int_mm(self): | |
@torch.compile | |
def foo(a, b): | |
return torch._int_mm(a, b) | |
foo( | |
torch.randint(-10, 10, (64, 32), device=GPU_TYPE, dtype=torch.int8), | |
torch.randint(-10, 10, (32, 64), device=GPU_TYPE, dtype=torch.int8), | |
) | |
self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1) |
|
||
Tensor bias = at::Tensor(); | ||
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble)); | ||
Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhiweiYan-96 I found mat2_zero_points
is an unused parameter in quantized_matmul
, right? Could you help to check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, weight_zp!=0 has no usage currently. This is due to pt2e operators are specifically designed for nn.Linear module and weight should have no zp.
By the way, @xiaowangintel you may set it m2_zp = std::nullopt
to save 1 copy kernel and 1 kernel launch here, this should be helpful to low-precision inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean Tensor mat2_zero_points = at::Tensor();
is enough?
} | ||
|
||
Tensor bias = at::Tensor(); | ||
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that mat2_scales
will be converted to kFloat
inside quantized_matmul
, so why do we need to change its dtype here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@parametrize("use_transpose_a", [True, False]) | ||
@parametrize("use_transpose_b", [True, False]) | ||
@parametrize("non_contig_type", [0, 1, 2]) | ||
def test__int_mm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
The only nit: instead of adding the new unit test, could we share the existing one with another backend? such as
Line 7448 in 1b58e7a
def test__int_mm_cpu(self, device, m, k, n, use_transpose_a, use_transpose_b, non_contig_type): |
and
pytorch/test/test_out_dtype_op.py
Lines 187 to 203 in 1b58e7a
def test_out_dtype_inductor_decomp_trace(self) -> None: | |
def func(x, w): | |
return out_dtype(torch.ops.aten.mm.default, torch.int32, x, w) | |
w = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda") | |
x = torch.randint(-128, 127, (32, 32), dtype=torch.int8, device="cuda") | |
# Check that make_fx with inductor decomps produces _int_mm | |
decomp_table = torch._inductor.decomposition.select_decomp_table() | |
gm = make_fx(func, decomp_table, tracing_mode="symbolic")(x, w) | |
self.assertExpectedInline(gm.code.strip(), """\ | |
def forward(self, x_1, w_1): | |
_int_mm = torch.ops.aten._int_mm.default(x_1, w_1); x_1 = w_1 = None | |
return _int_mm""") | |
@unittest.skipIf(not TEST_CUDA, "cuda only") | |
def test_out_dtype_int_mm_default_trace(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you could only enable these three UTs for Intel GPU, not all UTs in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guangyey Per my understanding, to enable these 3 UTs, we need to enable the XPU device for all UTs and add a lot skip for other UTs, it may involve a lot of code changes, I prefer to keep the same logic as other UTs which means that we can firstly enable the XPU device for int_mm in the ported test_linear.py in the torch-xpu-op by @deng, and she will upstream these UT to the public test_linalg.py in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangan1 , @xiaowangintel , do you know why adding XPU results in many failed cases? I'm trying to understand the feature gap.
By the way, we can extract the 3 UTs as another test class and then enable XPU for this newly added test class.
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
||
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__addmm_activation(AtenTensorHandle self, AtenTensorHandle mat1, AtenTensorHandle mat2, double beta, double alpha, int32_t use_gelu, AtenTensorHandle* ret0); | ||
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__fused_moving_avg_obs_fq_helper_functional(AtenTensorHandle self, AtenTensorHandle observer_on, AtenTensorHandle fake_quant_on, AtenTensorHandle running_min, AtenTensorHandle running_max, AtenTensorHandle scale, AtenTensorHandle zero_point, double averaging_const, int64_t quant_min, int64_t quant_max, int64_t ch_axis, int32_t per_row_fake_quant, int32_t symmetric_quant, AtenTensorHandle* ret0, AtenTensorHandle* ret1, AtenTensorHandle* ret2, AtenTensorHandle* ret3, AtenTensorHandle* ret4, AtenTensorHandle* ret5); | ||
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_xpu__int_mm_out(AtenTensorHandle out, AtenTensorHandle self, AtenTensorHandle mat2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etaf please help review the change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change auto generated by torchgen? Expected to be yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @desertfire to confirm if we are ok with this one or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a discussion with @albanD offline. Adding a XPU entry is ok here, but we also want to make it clear that the XPU backend is experimental so we don't make any BC promises for it. Thus I plan to create an "_experimental" direcotry under torch/csrc/inductor/aoti_torch/generated/
, movec_shim_xpu.h
under there.
|
||
Tensor bias = at::Tensor(); | ||
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kDouble)); | ||
Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, weight_zp!=0 has no usage currently. This is due to pt2e operators are specifically designed for nn.Linear module and weight should have no zp.
By the way, @xiaowangintel you may set it m2_zp = std::nullopt
to save 1 copy kernel and 1 kernel launch here, this should be helpful to low-precision inference.
|
||
TORCH_CHECK( | ||
self.dtype() == at::kChar, | ||
": Expected self dtype to be of type int8 but got ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
": Expected self dtype to be of type int8 but got ", | |
"Expected self dtype to be of type int8 but got ", |
self.dtype()); | ||
TORCH_CHECK( | ||
mat2.dtype() == at::kChar, | ||
": Expected mat2 dtype to be of type int8 but got ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
": Expected mat2 dtype to be of type int8 but got ", | |
"Expected mat2 dtype to be of type int8 but got ", |
|
||
TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous."); | ||
|
||
if (result.numel() == 0 || self.size(1) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this check not check self.numel()
? What's the meaning of self.size(1) == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaowangintel , could you please elaborate on self.size(1) == 0
and why not self.numel() == 0
?
Tensor mat2_scales = at::ones({1}, mat2.options().dtype(at::kFloat)); | ||
Tensor mat2_zero_points = at::zeros({1}, mat2.options().dtype(at::kInt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhiweiYan-96 , does oneDNN support both the scale and zero_point are scalars? I'm kind of concerned about the integration overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scale and zp need to be wrapped to be xpu tensor to be used as the attribute inputs, the scalar is not supported yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneDNN cannot support pass host scalar directly, currently. The H2D copy here is always existent. We can remove the h2d copy here after the fixing merged in oneDNN.
@parametrize("use_transpose_a", [True, False]) | ||
@parametrize("use_transpose_b", [True, False]) | ||
@parametrize("non_contig_type", [0, 1, 2]) | ||
def test__int_mm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangan1 , @xiaowangintel , do you know why adding XPU results in many failed cases? I'm trying to understand the feature gap.
By the way, we can extract the 3 UTs as another test class and then enable XPU for this newly added test class.
@xiaowangintel , pls. help resovlve conflicts. |
cf25f1c
to
e511d78
Compare
|
||
TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous."); | ||
|
||
if (result.numel() == 0 || self.size(1) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaowangintel , could you please elaborate on self.size(1) == 0
and why not self.numel() == 0
?
@etaf, Is the failed case a new failure? |
Yes, I'll disable it first. |
@albanD , @desertfire , may I know if you have any comments on this PR? |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Moativation
This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.
Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext
Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben