[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device #144756

retonym · 2025-01-14T06:41:48Z

Reopen the previous stale closed PR #134192

We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

pytorch-bot · 2025-01-14T06:41:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144756

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 76195b4 with merge base b4cee2b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

retonym · 2025-01-14T06:43:37Z

@EikanWang please help to review this reopened PR. Previous PR #134192 with same change is closed due to stale status

EikanWang · 2025-01-15T03:24:34Z

benchmarks/dynamo/torchbench.py

Suggested change

def xpu_higher_tolerance(self, current_device, name):

return (

current_device == "xpu" and name in self._tolerance["higher_fp16_bf16_xpu"]

)

def xpu_higher_tolerance(self, current_device):

return self._tolerance["higher_fp16_bf16_xpu"] if current_device == "xpu" else []

Thanks for the suggestions; the code has been modified accordingly.

EikanWang · 2025-01-15T03:26:14Z

benchmarks/dynamo/torchbench.py

Suggested change

if name in self._tolerance["higher_fp16"] or self.xpu_higher_tolerance(

current_device, name

):

if name in self._tolerance["higher_fp16"] + self.xpu_higher_tolerance(current_device):

EikanWang · 2025-01-15T03:26:55Z

benchmarks/dynamo/torchbench.py

Suggested change

if name in self._tolerance["higher_bf16"] or self.xpu_higher_tolerance(

current_device, name

):

if name in self._tolerance["higher_bf16"] + self.xpu_higher_tolerance(current_device):

retonym · 2025-01-21T07:35:43Z

@pytorchbot rebase

pytorchmergebot · 2025-01-21T07:37:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-21T07:37:12Z

Successfully rebased yunfei/xpu_tolerance onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout yunfei/xpu_tolerance && git pull --rebase)

retonym · 2025-02-05T02:14:43Z

@pytorchbot rebase

pytorchmergebot · 2025-02-05T02:16:13Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-05T02:16:16Z

Tried to rebase and push PR #144756, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

retonym · 2025-02-07T01:22:36Z

@desertfire @EikanWang Could you please help review and merge this PR? It introduces a tolerance setting for the XPU device in the Dynamo benchmark test.

jianyizh · 2025-03-12T15:10:53Z

@pytorchbot rebase

pytorchmergebot · 2025-03-12T15:12:20Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-12T15:12:24Z

Successfully rebased yunfei/xpu_tolerance onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout yunfei/xpu_tolerance && git pull --rebase)

EikanWang · 2025-04-16T05:52:22Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-04-16T05:53:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-04-16T05:53:57Z

Successfully rebased yunfei/xpu_tolerance onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout yunfei/xpu_tolerance && git pull --rebase)

EikanWang · 2025-04-16T21:44:46Z

@pytorchbot merge

pytorchmergebot · 2025-04-16T21:48:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… Device (pytorch#144756) Reopen the previous stale closed PR pytorch#134192 We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device. This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device. Pull Request resolved: pytorch#144756 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire

malfet · 2025-04-17T11:06:45Z

@pytorchbot revert -m "Broke rocm torch bench runs with TypeError: unsupported operand type(s) for |: 'set' and 'list'" -c nosignal

pytorchmergebot · 2025-04-17T11:08:54Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…y on XPU Device (#144756)" This reverts commit 300e0ee. Reverted #144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with TypeError: unsupported operand type(s) for |: 'set' and 'list' ([comment](#144756 (comment)))

pytorchmergebot · 2025-04-17T11:09:04Z

@retonym your PR has been successfully reverted.

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

github-actions · 2025-06-16T14:38:54Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added the module: dynamo label Jan 14, 2025

pytorchbot added the open source label Jan 14, 2025

EikanWang added the topic: not user facing topic category label Jan 15, 2025

EikanWang requested review from chuanqi129 and desertfire January 15, 2025 03:16

chuanqi129 previously approved these changes Jan 15, 2025

View reviewed changes

EikanWang previously approved these changes Jan 15, 2025

View reviewed changes

desertfire previously approved these changes Jan 15, 2025

View reviewed changes

retonym force-pushed the yunfei/xpu_tolerance branch from 9e954c9 to 8373c17 Compare January 16, 2025 05:47

pytorchmergebot force-pushed the yunfei/xpu_tolerance branch from 8373c17 to d3f2c6e Compare January 21, 2025 07:37

retonym force-pushed the yunfei/xpu_tolerance branch from d3f2c6e to f82ee57 Compare February 5, 2025 02:14

jianyizh mentioned this pull request Mar 12, 2025

Timm_regnet got fail_accuracy intel/torch-xpu-ops#493

Closed

pytorchmergebot force-pushed the yunfei/xpu_tolerance branch from f82ee57 to 7566eff Compare March 12, 2025 15:12

set higher tolerance for some models only on xpu

76195b4

pytorchmergebot force-pushed the yunfei/xpu_tolerance branch from 7566eff to 76195b4 Compare April 16, 2025 05:53

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 16, 2025

pytorchmergebot added the merging label Apr 16, 2025

pytorchmergebot added the Merged label Apr 17, 2025

pytorchmergebot closed this in 300e0ee Apr 17, 2025

pytorchmergebot removed the merging label Apr 17, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Apr 17, 2025

pytorchmergebot reopened this Apr 17, 2025

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 17, 2025

github-actions bot added the Stale label Jun 16, 2025

github-actions bot closed this Jul 16, 2025

-    def xpu_higher_tolerance(self, current_device, name):
-        return (
-            current_device == "xpu" and name in self._tolerance["higher_fp16_bf16_xpu"]
-        )
+    def xpu_higher_tolerance(self, current_device):
+        return self._tolerance["higher_fp16_bf16_xpu"] if current_device == "xpu" else []

-            if name in self._tolerance["higher_fp16"] or self.xpu_higher_tolerance(
-                current_device, name
-            ):
+            if name in self._tolerance["higher_fp16"] + self.xpu_higher_tolerance(current_device):

-            if name in self._tolerance["higher_bf16"] or self.xpu_higher_tolerance(
-                current_device, name
-            ):
+            if name in self._tolerance["higher_bf16"] + self.xpu_higher_tolerance(current_device):

[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device #144756

[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device #144756

Uh oh!

Conversation

retonym commented Jan 14, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144756

✅ No Failures

Uh oh!

retonym commented Jan 14, 2025

Uh oh!

EikanWang Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

retonym Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

retonym commented Jan 21, 2025

Uh oh!

pytorchmergebot commented Jan 21, 2025

Uh oh!

pytorchmergebot commented Jan 21, 2025

Uh oh!

retonym commented Feb 5, 2025

Uh oh!

pytorchmergebot commented Feb 5, 2025

Uh oh!

pytorchmergebot commented Feb 5, 2025

Uh oh!

retonym commented Feb 7, 2025

Uh oh!

jianyizh commented Mar 12, 2025

Uh oh!

pytorchmergebot commented Mar 12, 2025

Uh oh!

pytorchmergebot commented Mar 12, 2025

Uh oh!

EikanWang commented Apr 16, 2025

Uh oh!

pytorchmergebot commented Apr 16, 2025

Uh oh!

pytorchmergebot commented Apr 16, 2025

Uh oh!

EikanWang commented Apr 16, 2025

Uh oh!

pytorchmergebot commented Apr 16, 2025

Merge started

Uh oh!

malfet commented Apr 17, 2025

Uh oh!

pytorchmergebot commented Apr 17, 2025

Uh oh!

pytorchmergebot commented Apr 17, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

retonym commented Jan 14, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 14, 2025 •

edited

Loading