Skip to content

Conversation

retonym
Copy link
Contributor

@retonym retonym commented Jan 14, 2025

Reopen the previous stale closed PR #134192

We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

Copy link

pytorch-bot bot commented Jan 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144756

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 76195b4 with merge base b4cee2b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@retonym
Copy link
Contributor Author

retonym commented Jan 14, 2025

@EikanWang please help to review this reopened PR. Previous PR #134192 with same change is closed due to stale status

chuanqi129
chuanqi129 previously approved these changes Jan 15, 2025
EikanWang
EikanWang previously approved these changes Jan 15, 2025
Comment on lines 411 to 424
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def xpu_higher_tolerance(self, current_device, name):
return (
current_device == "xpu" and name in self._tolerance["higher_fp16_bf16_xpu"]
)
def xpu_higher_tolerance(self, current_device):
return self._tolerance["higher_fp16_bf16_xpu"] if current_device == "xpu" else []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions; the code has been modified accordingly.

Comment on lines 421 to 439
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name in self._tolerance["higher_fp16"] or self.xpu_higher_tolerance(
current_device, name
):
if name in self._tolerance["higher_fp16"] + self.xpu_higher_tolerance(current_device):

Comment on lines 430 to 448
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name in self._tolerance["higher_bf16"] or self.xpu_higher_tolerance(
current_device, name
):
if name in self._tolerance["higher_bf16"] + self.xpu_higher_tolerance(current_device):

desertfire
desertfire previously approved these changes Jan 15, 2025
@retonym retonym force-pushed the yunfei/xpu_tolerance branch from 9e954c9 to 8373c17 Compare January 16, 2025 05:47
@retonym
Copy link
Contributor Author

retonym commented Jan 21, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased yunfei/xpu_tolerance onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout yunfei/xpu_tolerance && git pull --rebase)

@retonym
Copy link
Contributor Author

retonym commented Feb 5, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #144756, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@retonym
Copy link
Contributor Author

retonym commented Feb 7, 2025

@desertfire @EikanWang Could you please help review and merge this PR? It introduces a tolerance setting for the XPU device in the Dynamo benchmark test.

@jianyizh
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased yunfei/xpu_tolerance onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout yunfei/xpu_tolerance && git pull --rebase)

@EikanWang
Copy link
Collaborator

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased yunfei/xpu_tolerance onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout yunfei/xpu_tolerance && git pull --rebase)

@EikanWang
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 16, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
… Device (pytorch#144756)

Reopen the previous stale closed PR pytorch#134192

We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.

Pull Request resolved: pytorch#144756
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire
@malfet
Copy link
Contributor

malfet commented Apr 17, 2025

@pytorchbot revert -m "Broke rocm torch bench runs with TypeError: unsupported operand type(s) for |: 'set' and 'list'" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Apr 17, 2025
…y on XPU Device (#144756)"

This reverts commit 300e0ee.

Reverted #144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with  TypeError: unsupported operand type(s) for |: 'set' and 'list' ([comment](#144756 (comment)))
@pytorchmergebot
Copy link
Collaborator

@retonym your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Apr 17, 2025
@pytorch-bot pytorch-bot bot dismissed stale reviews from chuanqi129, EikanWang, and desertfire April 17, 2025 11:09

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 17, 2025
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 16, 2025
@github-actions github-actions bot closed this Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo open source Reverted Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants