-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Minor fixs to make torchbench runable on torch/xla #107919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107919
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c7a1b50 with merge base c99a70c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
benchmarks/dynamo/common.py
Outdated
try: | ||
import torch_xla.core.xla_model as xm | ||
|
||
device = xm.xla_device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of calling xla_device()?
i assume if it failed it would throw a different exception so catching importerror isn't enough, also its odd to bind it to a name. is it enough to just do the import here and do the device call somewhere later at first use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the actual bug is discussed in pytorch/xla#4174. I think it required pytorch/xla to registered something to the pytorch so backward can run correctly(when there is a cpu backward was run first).
The reason that old workaround works is we used to eagerly init the runtime but now that doesn't happen until actual device is being init. I need to root cause this issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Init the device is not ideal as it actually make it impossible to run more than one tests at a same time on TPU, since the main process init the device outside of the spawn
function and keep the device.
I tried a few other approach and they didn't work. I decided it is better to at least make training runable and fix this later in nightly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, let's make sure the script still works when XLA is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should, since the device init only happens when torch_xla import succed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to dig into this issue a bit more this afternoon and see if I can workaround it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a few things may make it better
- have a API to explicitly initialize the xla runtime and call that rather than 'xla_device'. This can avoid some confusion.
- or figure out the root cause :)
But I'm ok to stamp the PR now to unblock. Unless @wconstab has other thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Shunting, I added a API to explicitly init runtime. Let me update it.
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot label "topic: not user facing" |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
failure doesn't seem relevant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that test_foreach is not relevant. if other tests all pass, you can merge with -f flag (or i can help) but must wait for tests to finish before merging or -f will ignore still-pending tests
Only test failure seems irrelevant. I will force merge |
@pytorchbot merge -f "test failure is irrelevant" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
`import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for pytorch/xla#4174. `is_correct` reference has been deleted, I think it is a deadcode. After this patch, I am able to run ``` python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50 ``` Pull Request resolved: #107919 Approved by: https://github.com/shunting314, https://github.com/wconstab
@pytorchbot revert -m 'Conflicts with the revert of #106914' -c ghfirst Hi @JackCaoG, sorry for the churn, but I have to unland your PR temporarily, as it conflicts with another revert (#106914) (in xla hash). Please rebase and reland at your convenience. |
@pytorchbot revert -m 'Conflicts with the revert of 106914' -c ghfirst |
@pytorchbot successfully started a revert job. Check the current status here. |
@JackCaoG your PR has been successfully reverted. |
This reverts commit ed8f212. Reverted #107919 on behalf of https://github.com/izaitsevfb due to Conflicts with the revert of 106914 ([comment](#107919 (comment)))
@pytorchbot successfully started a revert job. Check the current status here. |
@JackCaoG your PR has been successfully reverted. |
PR was reverted because it touches XLA pin and there was another PR that also touches XLA pin got reverted. Reopen the pr and try to merge it again. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
import torch_xla.core.xla_model as xm
no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for pytorch/xla#4174.is_correct
reference has been deleted, I think it is a deadcode.After this patch, I am able to run
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @kadeng @anijain2305