Minor fixs to make torchbench runable on torch/xla #107919

JackCaoG · 2023-08-25T01:30:52Z

import torch_xla.core.xla_model as xm no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for pytorch/xla#4174.

is_correct reference has been deleted, I think it is a deadcode.

After this patch, I am able to run

python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @kadeng @anijain2305

pytorch-bot · 2023-08-25T01:30:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107919

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c7a1b50 with merge base c99a70c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2023-08-25T18:17:42Z

benchmarks/dynamo/common.py

 try:
    import torch_xla.core.xla_model as xm
+
+    device = xm.xla_device()


what's the purpose of calling xla_device()?

i assume if it failed it would throw a different exception so catching importerror isn't enough, also its odd to bind it to a name. is it enough to just do the import here and do the device call somewhere later at first use?

the actual bug is discussed in pytorch/xla#4174. I think it required pytorch/xla to registered something to the pytorch so backward can run correctly(when there is a cpu backward was run first).

The reason that old workaround works is we used to eagerly init the runtime but now that doesn't happen until actual device is being init. I need to root cause this issue...

Init the device is not ideal as it actually make it impossible to run more than one tests at a same time on TPU, since the main process init the device outside of the spawn function and keep the device.

I tried a few other approach and they didn't work. I decided it is better to at least make training runable and fix this later in nightly

yea, let's make sure the script still works when XLA is missing.

it should, since the device init only happens when torch_xla import succed.

Let me try to dig into this issue a bit more this afternoon and see if I can workaround it.

I think a few things may make it better

have a API to explicitly initialize the xla runtime and call that rather than 'xla_device'. This can avoid some confusion.

or figure out the root cause :)

But I'm ok to stamp the PR now to unblock. Unless @wconstab has other thoughts.

Thanks Shunting, I added a API to explicitly init runtime. Let me update it.

benchmarks/dynamo/common.py

JackCaoG · 2023-08-25T22:11:57Z

@pytorchbot merge

pytorchmergebot · 2023-08-25T22:13:38Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

JackCaoG · 2023-08-25T22:17:42Z

@pytorchbot label "topic: not user facing"

JackCaoG · 2023-08-25T22:22:05Z

@pytorchbot merge

pytorchmergebot · 2023-08-25T22:24:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-25T22:24:51Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

JackCaoG · 2023-08-25T22:31:40Z

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
- generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_foreach/inductor.test_foreach-016b9206757a35ab.xml -
=========================== short test summary info ============================
FAILED [0.2053s] inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_add - AssertionError: Scalars are not equal!

Expected 6 but got 5.
Absolute difference: 1
Relative difference: 0.16666666666666666

failure doesn't seem relevant

wconstab

agree that test_foreach is not relevant. if other tests all pass, you can merge with -f flag (or i can help) but must wait for tests to finish before merging or -f will ignore still-pending tests

JackCaoG · 2023-08-26T03:33:00Z

Only test failure seems irrelevant. I will force merge

JackCaoG · 2023-08-26T03:33:13Z

@pytorchbot merge -f "test failure is irrelevant"

pytorchmergebot · 2023-08-26T03:34:48Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

`import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for pytorch/xla#4174. `is_correct` reference has been deleted, I think it is a deadcode. After this patch, I am able to run ``` python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50 ``` Pull Request resolved: #107919 Approved by: https://github.com/shunting314, https://github.com/wconstab

izaitsevfb · 2023-08-29T02:05:50Z

@pytorchbot revert -m 'Conflicts with the revert of #106914' -c ghfirst

Hi @JackCaoG, sorry for the churn, but I have to unland your PR temporarily, as it conflicts with another revert (#106914) (in xla hash).

Please rebase and reland at your convenience.

izaitsevfb · 2023-08-29T02:09:10Z

@pytorchbot revert -m 'Conflicts with the revert of 106914' -c ghfirst

pytorchmergebot · 2023-08-29T02:18:01Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-08-29T02:18:11Z

@JackCaoG your PR has been successfully reverted.

This reverts commit ed8f212. Reverted #107919 on behalf of https://github.com/izaitsevfb due to Conflicts with the revert of 106914 ([comment](#107919 (comment)))

pytorchmergebot · 2023-08-29T02:20:26Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-08-29T02:20:40Z

@JackCaoG your PR has been successfully reverted.

JackCaoG · 2023-09-05T22:26:21Z

PR was reverted because it touches XLA pin and there was another PR that also touches XLA pin got reverted. Reopen the pr and try to merge it again.

JackCaoG · 2023-09-06T22:32:46Z

@pytorchbot merge

pytorchmergebot · 2023-09-06T22:35:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Minor fixs to make torchbench runable on torch/xla

044bcdc

JackCaoG requested a review from shunting314 August 25, 2023 01:30

github-actions bot added module: dynamo ciflow/inductor labels Aug 25, 2023

pytorchbot added the open source label Aug 25, 2023

fix linter

f7eaad4

JackCaoG requested a review from wconstab August 25, 2023 18:09

wconstab reviewed Aug 25, 2023

View reviewed changes

benchmarks/dynamo/common.py Show resolved Hide resolved

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 25, 2023

shunting314 approved these changes Aug 25, 2023

View reviewed changes

address review comments

b94754a

JackCaoG requested a review from a team as a code owner August 25, 2023 20:51

Merge branch 'main' into JackCaoG/fix_xla_torchbench

309a524

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 25, 2023

pytorchmergebot added the merging label Aug 25, 2023

pytorchmergebot removed the merging label Aug 25, 2023

pytorch-bot bot added the topic: not user facing topic category label Aug 25, 2023

pytorchmergebot added the merging label Aug 25, 2023

pytorchmergebot removed the merging label Aug 25, 2023

wconstab approved these changes Aug 25, 2023

View reviewed changes

pytorchmergebot added the merging label Aug 26, 2023

pytorchmergebot added Merged and removed merging labels Aug 26, 2023

pytorchmergebot closed this in ed8f212 Aug 26, 2023

pytorchmergebot added the Reverted label Aug 29, 2023

JackCaoG reopened this Sep 5, 2023

Merge branch 'main' into JackCaoG/fix_xla_torchbench

c7a1b50

pytorchmergebot added the merging label Sep 6, 2023

pytorchmergebot removed the merging label Sep 6, 2023

pytorchmergebot closed this in e73ec92 Sep 6, 2023

github-actions bot deleted the JackCaoG/fix_xla_torchbench branch March 7, 2025 02:07

Minor fixs to make torchbench runable on torch/xla #107919

Minor fixs to make torchbench runable on torch/xla #107919

Uh oh!

Conversation

JackCaoG commented Aug 25, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107919

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackCaoG commented Aug 25, 2023

Uh oh!

pytorchmergebot commented Aug 25, 2023

Merge failed

Uh oh!

JackCaoG commented Aug 25, 2023

Uh oh!

JackCaoG commented Aug 25, 2023

Uh oh!

pytorchmergebot commented Aug 25, 2023

Merge started

Uh oh!

pytorchmergebot commented Aug 25, 2023

Merge failed

Uh oh!

JackCaoG commented Aug 25, 2023

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Aug 26, 2023

Uh oh!

JackCaoG commented Aug 26, 2023

Uh oh!

pytorchmergebot commented Aug 26, 2023

Merge started

Uh oh!

izaitsevfb commented Aug 29, 2023

Uh oh!

izaitsevfb commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Uh oh!

JackCaoG commented Sep 5, 2023

Uh oh!

JackCaoG commented Sep 6, 2023

Uh oh!

pytorchmergebot commented Sep 6, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

JackCaoG commented Aug 25, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 25, 2023 •

edited

Loading