Skip to content

[Don't Merge] Fix poision child process issue when call getAccelerator() #144368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

guangyey
Copy link
Collaborator

@guangyey guangyey commented Jan 8, 2025

Stack from ghstack (oldest at bottom):

Motivation

fix #144152

Solution

  • Align at::globalContext()::hasXXX to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
  • Define at::hasXXX to determine if accelerator XXX is available at runtime.
  • Use at::globalContext()::hasXXX in getAccelerator rather than at::hasXXX to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @albanD

Copy link

pytorch-bot bot commented Jan 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144368

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 2 Unrelated Failures

As of commit 4636c6f with merge base 334ee8b (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 8, 2025
@guangyey guangyey requested a review from albanD January 8, 2025 04:35
@guangyey guangyey added module: accelerator Issues related to the shared accelerator API ciflow/xpu Run XPU CI tasks topic: improvements topic category topic: bug fixes topic category release notes: cpp release notes category topic: not user facing topic category and removed ciflow/xpu Run XPU CI tasks topic: bug fixes topic category release notes: cpp release notes category labels Jan 8, 2025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit to inline some doc. Sounds good otherwise!

@@ -485,7 +485,8 @@ inline DeprecatedTypeProperties& MPS(ScalarType s) {
}

inline bool hasCUDA() {
return globalContext().hasCUDA();
return globalContext().hasCUDA() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave the same comment as in the PR description in this file saying the compile time vs runtime versions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave the comments in this file. Is it OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a BC breaking change? Or there were no hasCUDA in 2.5?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @malfet For at::hasCUDA, it is not a BC-breaking change. However, for globalContext().hasCUDA(), it could be considered a BC-breaking change.
We plan to modify globalContext().hasCUDA() to check for CUDA availability at compile time (based on the build) instead of runtime. Do you have any concerns about this change?

@pytorchbot
Copy link
Collaborator

Cherry picking #144368

The cherry pick PR is at #144541 and it is recommended to link a regression cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

@clee2000
Copy link
Contributor

@pytorchbot revert -m "broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jan 10, 2025
This reverts commit eeb5739.

Reverted #144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](#144368 (comment)))
pytorchmergebot added a commit that referenced this pull request Jan 10, 2025
…144368)"

This reverts commit 2583d83.

Reverted #144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](#144368 (comment)))
@pytorchmergebot
Copy link
Collaborator

@guangyey your PR has been successfully reverted.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@@ -6,7 +6,7 @@ namespace at::accelerator {

std::optional<c10::DeviceType> getAccelerator(bool checked) {
#define DETECT_AND_ASSIGN_ACCELERATOR(device_name) \
if (at::has##device_name()) { \
if (at::globalContext().has##device_name()) { \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soooo from a quick chat with @egienvalue looks like there exists some internal config that have CUDA + MTIA in a single binary.
While we fix that on Meta's side, would you be able to comment out the DETECT_AND_ASSIGN_ACCELERATOR(MTIA) so we can land this PR without breaking internal.
I'll work with @egienvalue on the appropriate fix in a follow up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. By the way, this fix will introduce an issue tracked by #144567. This issue will be fixed in the subsequent PR #144664.

[ghstack-poisoned]
or TEST_PRIVATEUSE1
or TEST_ROCM
or TEST_XPU
or True,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD @egienvalue Have to skip MTIA UT temporarily as well.

@guangyey guangyey requested a review from albanD January 15, 2025 10:06
[ghstack-poisoned]
@albanD
Copy link
Collaborator

albanD commented Jan 16, 2025

Sorry for the confusion on this one. There are quite a few moving pieces.

  • In the short term, we can't enforce the compile-time check. And since the check happens at runtime, I think it is best to just bite the bullet and make this a runtime check (preserving the behavior from before this PR).
  • If we want to add compile checks (on top of the runtime ones), then we should make them happen at compilation time (via the USE_* flags), and we will need to make sure we allow cuda+mtia at least.
  • Preserving the current behavior for these accelerator checks means we can keep the current tests.
  • I am still not sure how this PR links to the one above in the stack? Is the other one fixing a problem introduced here as a side effect from changing the semantic?
  • Generally, I have a few concerns wrt stability here, I am struggling to keep in my head all the side effects of the changes we do. Don't hesitate to have longer descriptions with details on the implication to make sure I'm not missing anything!

[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

In my opinion, it is fine not to change the behavior of getAccelerator in the short term, but ultimately it will introduce a limitation - it initializes the device runtime, possibly causing a poison fork issue. We should fix this issue in the long term. Currently, it is being blocked by some use cases of MTIA.

I am still not sure how this PR links to the one above in the stack? Is the other one fixing a problem introduced here as a side effect from changing the semantic?

Yes. Some of the unreasonable logic in the accelerator API is being hidden by the runtime check in getAccelerator. We should fix it, regardless of whether or not we change the behavior of getAccelerator. I will elaborate on the details in #144664.

@atalman atalman added this to the 2.6.1 milestone Jan 21, 2025
@guangyey guangyey changed the title Fix poision child process issue when call getAccelerator() [Don't Merge] Fix poision child process issue when call getAccelerator() Jan 22, 2025
@albanD
Copy link
Collaborator

albanD commented Jan 31, 2025

Sorry for the delay on this one, any delta between this and the PR at #146098 ?

@guangyey
Copy link
Collaborator Author

guangyey commented Feb 8, 2025

#146098 is good enough, I will close this PR.

@guangyey guangyey closed this Feb 8, 2025
@github-actions github-actions bot deleted the gh/guangyey/114/head branch March 11, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: accelerator Issues related to the shared accelerator API oncall: jit Add this issue/PR to JIT oncall triage queue open source Reverted topic: improvements topic category topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants