Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hpu storage serialization #101680

Closed

Conversation

ppiskorski
Copy link
Contributor

@ppiskorski ppiskorski commented May 17, 2023

Change-Id: Ia534400a0e8972590374eceba5b62a2525b796e5

Fixes #ISSUE_NUMBER

cc @mruberry @mikaylagawarecki

@pytorch-bot
Copy link

pytorch-bot bot commented May 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101680

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4fc1665:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@drisspg drisspg added module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 17, 2023
@github-actions
Copy link

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@ppiskorski ppiskorski force-pushed the hpu_storage_serialization branch 2 times, most recently from f3f4f9b to 6cf7306 Compare May 22, 2023 07:19
@ppiskorski ppiskorski force-pushed the hpu_storage_serialization branch 2 times, most recently from 2831579 to 990b4a6 Compare June 5, 2023 16:15
@mikaylagawarecki mikaylagawarecki self-requested a review June 9, 2023 16:56
Copy link
Contributor

@mikaylagawarecki mikaylagawarecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just two questions

"""
non_blocking = _get_async_or_non_blocking("hpu", non_blocking, kwargs)
hpu = getattr(torch, "hpu", None)
assert hpu is not None, "HPU device module is not loaded"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it guaranteed that hpu will be available if hasattr(torch, "hpu")? do we have an equivalent of torch.{device}.is_available() for hpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, availability of torch.hpu only means that the module has been loaded and subsequent check for actual device can be done with torch.hpu.is_available, just like you suspect. Only that at this level (_utils.py) there are no such checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm why is it not available in _utils.py? I see that torch.xpu.is_available() is used in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is aligned with _cuda. The user requested hpu storage on a particular device. That's a non-negotiable request so we'd error anyway if we can't make it. Explicit check of hpu.is_available does not buy us that much: we can use it to error that there are no hpu devices at all. Without it we also fail, but this time because the particular requested device is not there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying

def validate_hpu_device(location):
hpu = getattr(torch, "hpu", None)
assert hpu is not None, "HPU device module is not loaded"
device = hpu._utils._get_device_index(location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding is there a reason why we don't pass optional=True here? (just comparing with validate_cuda_device and curious to know if this is intentional)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The API that hpu module did not implement the 'optional' argument, but there is no good reason for this and I'll fix that.

check if hpu is loaded before accessing torch.hpu

make validate_hpu_device public

Change-Id: I8e356cc03f83e96810a6e504031c7f6f305680be
Copy link
Contributor

@mikaylagawarecki mikaylagawarecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this lgtm! Just one question

@mikaylagawarecki
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cuda11.8-py3 / build

Details for Dev Infra team Raised by workflow job

@mikaylagawarecki
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects open source release notes: python_frontend release notes category topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants