New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix hpu storage serialization #101680
fix hpu storage serialization #101680
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101680
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4fc1665: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a labelIf your changes are user facing and intended to be a part of release notes, please use a label starting with If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see |
f3f4f9b
to
6cf7306
Compare
2831579
to
990b4a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, just two questions
""" | ||
non_blocking = _get_async_or_non_blocking("hpu", non_blocking, kwargs) | ||
hpu = getattr(torch, "hpu", None) | ||
assert hpu is not None, "HPU device module is not loaded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it guaranteed that hpu will be available if hasattr(torch, "hpu")? do we have an equivalent of torch.{device}.is_available()
for hpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, availability of torch.hpu only means that the module has been loaded and subsequent check for actual device can be done with torch.hpu.is_available, just like you suspect. Only that at this level (_utils.py) there are no such checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm why is it not available in _utils.py
? I see that torch.xpu.is_available()
is used in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is aligned with _cuda. The user requested hpu storage on a particular device. That's a non-negotiable request so we'd error anyway if we can't make it. Explicit check of hpu.is_available
does not buy us that much: we can use it to error that there are no hpu devices at all. Without it we also fail, but this time because the particular requested device
is not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for clarifying
torch/serialization.py
Outdated
def validate_hpu_device(location): | ||
hpu = getattr(torch, "hpu", None) | ||
assert hpu is not None, "HPU device module is not loaded" | ||
device = hpu._utils._get_device_index(location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding is there a reason why we don't pass optional=True here? (just comparing with validate_cuda_device
and curious to know if this is intentional)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. The API that hpu module did not implement the 'optional' argument, but there is no good reason for this and I'll fix that.
check if hpu is loaded before accessing torch.hpu make validate_hpu_device public Change-Id: I8e356cc03f83e96810a6e504031c7f6f305680be
ca82ec1
to
4fc1665
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this lgtm! Just one question
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cuda11.8-py3 / build Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Change-Id: Ia534400a0e8972590374eceba5b62a2525b796e5
Fixes #ISSUE_NUMBER
cc @mruberry @mikaylagawarecki