-
Notifications
You must be signed in to change notification settings - Fork 25k
[Don't Merge] Fix poision child process issue when call getAccelerator() #144368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144368
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 2 Unrelated FailuresAs of commit 4636c6f with merge base 334ee8b ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit to inline some doc. Sounds good otherwise!
@@ -485,7 +485,8 @@ inline DeprecatedTypeProperties& MPS(ScalarType s) { | |||
} | |||
|
|||
inline bool hasCUDA() { | |||
return globalContext().hasCUDA(); | |||
return globalContext().hasCUDA() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave the same comment as in the PR description in this file saying the compile time vs runtime versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave the comments in this file. Is it OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a BC breaking change? Or there were no hasCUDA
in 2.5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @malfet For at::hasCUDA
, it is not a BC-breaking change. However, for globalContext().hasCUDA()
, it could be considered a BC-breaking change.
We plan to modify globalContext().hasCUDA()
to check for CUDA availability at compile time (based on the build) instead of runtime. Do you have any concerns about this change?
Cherry picking #144368The cherry pick PR is at #144541 and it is recommended to link a regression cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
@pytorchbot revert -m "broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above" -c nosignal |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit eeb5739. Reverted #144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](#144368 (comment)))
…144368)" This reverts commit 2583d83. Reverted #144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](#144368 (comment)))
@guangyey your PR has been successfully reverted. |
@@ -6,7 +6,7 @@ namespace at::accelerator { | |||
|
|||
std::optional<c10::DeviceType> getAccelerator(bool checked) { | |||
#define DETECT_AND_ASSIGN_ACCELERATOR(device_name) \ | |||
if (at::has##device_name()) { \ | |||
if (at::globalContext().has##device_name()) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soooo from a quick chat with @egienvalue looks like there exists some internal config that have CUDA + MTIA in a single binary.
While we fix that on Meta's side, would you be able to comment out the DETECT_AND_ASSIGN_ACCELERATOR(MTIA)
so we can land this PR without breaking internal.
I'll work with @egienvalue on the appropriate fix in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or TEST_PRIVATEUSE1 | ||
or TEST_ROCM | ||
or TEST_XPU | ||
or True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD @egienvalue Have to skip MTIA UT temporarily as well.
Sorry for the confusion on this one. There are quite a few moving pieces.
|
In my opinion, it is fine not to change the behavior of
Yes. Some of the unreasonable logic in the accelerator API is being hidden by the runtime check in |
Sorry for the delay on this one, any delta between this and the PR at #146098 ? |
#146098 is good enough, I will close this PR. |
Stack from ghstack (oldest at bottom):
Motivation
fix #144152
Solution
at::globalContext()::hasXXX
to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.at::hasXXX
to determine if accelerator XXX is available at runtime.at::globalContext()::hasXXX
ingetAccelerator
rather thanat::hasXXX
to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @albanD