[Don't Merge] Fix poision child process issue when call getAccelerator() #144368

guangyey · 2025-01-08T03:16:53Z

Stack from ghstack (oldest at bottom):

Motivation

fix #144152

Solution

Align at::globalContext()::hasXXX to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
Define at::hasXXX to determine if accelerator XXX is available at runtime.
Use at::globalContext()::hasXXX in getAccelerator rather than at::hasXXX to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @albanD

pytorch-bot · 2025-01-08T03:16:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144368

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 2 Unrelated Failures

As of commit 4636c6f with merge base 334ee8b ():

NEW FAILURES - The following jobs have failed:

xpu / linux-jammy-xpu-2025.0-py3.9 / build (gh)
ninja: build stopped: subcommand failed
xpu / linux-jammy-xpu-py3.9 / build (gh)
ninja: build stopped: subcommand failed
xpu / win-vs2022-xpu-2025_0-py3 / build (gh)
ninja: build stopped: subcommand failed
xpu / win-vs2022-xpu-py3 / build (gh)
ninja: build stopped: subcommand failed

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh) (disabled by #137771 but the issue was closed recently and a rebase is needed to make it pass)
distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

albanD

nit to inline some doc. Sounds good otherwise!

albanD · 2025-01-09T13:53:49Z

aten/src/ATen/Context.h

@@ -485,7 +485,8 @@ inline DeprecatedTypeProperties& MPS(ScalarType s) {
 }

 inline bool hasCUDA() {
-  return globalContext().hasCUDA();
+  return globalContext().hasCUDA() &&


Could you leave the same comment as in the PR description in this file saying the compile time vs runtime versions?

I leave the comments in this file. Is it OK?

Isn't this a BC breaking change? Or there were no hasCUDA in 2.5?

Hi @malfet For at::hasCUDA, it is not a BC-breaking change. However, for globalContext().hasCUDA(), it could be considered a BC-breaking change.
We plan to modify globalContext().hasCUDA() to check for CUDA availability at compile time (based on the build) instead of runtime. Do you have any concerns about this change?

pytorchbot · 2025-01-10T09:56:24Z

Cherry picking #144368

The cherry pick PR is at #144541 and it is recommended to link a regression cherry pick PR with an issue. The following tracker issues are updated:

[v.2.6.0] Release Tracker #142814 (comment)

Details for Dev Infra team

Raised by workflow job

clee2000 · 2025-01-10T23:35:10Z

@pytorchbot revert -m "broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above" -c nosignal

pytorchmergebot · 2025-01-10T23:36:31Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit eeb5739. Reverted #144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](#144368 (comment)))

…144368)" This reverts commit 2583d83. Reverted #144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](#144368 (comment)))

pytorchmergebot · 2025-01-10T23:36:50Z

@guangyey your PR has been successfully reverted.

[ghstack-poisoned]

albanD · 2025-01-14T15:02:44Z

aten/src/ATen/DeviceAccelerator.cpp

@@ -6,7 +6,7 @@ namespace at::accelerator {

 std::optional<c10::DeviceType> getAccelerator(bool checked) {
 #define DETECT_AND_ASSIGN_ACCELERATOR(device_name) \
-  if (at::has##device_name()) {                    \
+  if (at::globalContext().has##device_name()) {    \


Soooo from a quick chat with @egienvalue looks like there exists some internal config that have CUDA + MTIA in a single binary.
While we fix that on Meta's side, would you be able to comment out the DETECT_AND_ASSIGN_ACCELERATOR(MTIA) so we can land this PR without breaking internal.
I'll work with @egienvalue on the appropriate fix in a follow up PR.

I got it. By the way, this fix will introduce an issue tracked by #144567. This issue will be fixed in the subsequent PR #144664.

[ghstack-poisoned]

guangyey · 2025-01-15T10:06:00Z

test/test_cpp_extensions_mtia_backend.py

+    or TEST_PRIVATEUSE1
+    or TEST_ROCM
+    or TEST_XPU
+    or True,


@albanD @egienvalue Have to skip MTIA UT temporarily as well.

[ghstack-poisoned]

albanD · 2025-01-16T19:44:29Z

Sorry for the confusion on this one. There are quite a few moving pieces.

In the short term, we can't enforce the compile-time check. And since the check happens at runtime, I think it is best to just bite the bullet and make this a runtime check (preserving the behavior from before this PR).
If we want to add compile checks (on top of the runtime ones), then we should make them happen at compilation time (via the USE_* flags), and we will need to make sure we allow cuda+mtia at least.
Preserving the current behavior for these accelerator checks means we can keep the current tests.
I am still not sure how this PR links to the one above in the stack? Is the other one fixing a problem introduced here as a side effect from changing the semantic?
Generally, I have a few concerns wrt stability here, I am struggling to keep in my head all the side effects of the changes we do. Don't hesitate to have longer descriptions with details on the implication to make sure I'm not missing anything!

[ghstack-poisoned]

guangyey · 2025-01-17T06:32:58Z

In my opinion, it is fine not to change the behavior of getAccelerator in the short term, but ultimately it will introduce a limitation - it initializes the device runtime, possibly causing a poison fork issue. We should fix this issue in the long term. Currently, it is being blocked by some use cases of MTIA.

I am still not sure how this PR links to the one above in the stack? Is the other one fixing a problem introduced here as a side effect from changing the semantic?

Yes. Some of the unreasonable logic in the accelerator API is being hidden by the runtime check in getAccelerator. We should fix it, regardless of whether or not we change the behavior of getAccelerator. I will elaborate on the details in #144664.

albanD · 2025-01-31T15:39:23Z

Sorry for the delay on this one, any delta between this and the PR at #146098 ?

guangyey · 2025-02-08T06:02:43Z

#146098 is good enough, I will close this PR.

guangyey requested review from eqy and syed-ahmed as code owners January 8, 2025 03:16

guangyey mentioned this pull request Jan 8, 2025

Generalize pin memory logic for accelerator when non blocking copy happened #143783

Closed

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 8, 2025

pytorchbot added the open source label Jan 8, 2025

guangyey mentioned this pull request Jan 8, 2025

[RELAND] Generalize at::manual_seed for all accelerators #144370

Closed

guangyey requested review from EikanWang and gujinghui as code owners January 8, 2025 04:28

guangyey requested a review from albanD January 8, 2025 04:35

guangyey added 6 commits January 8, 2025 10:54

Update

75fee61

[ghstack-poisoned]

Update

bc0ffb2

[ghstack-poisoned]

Update

b5a5836

[ghstack-poisoned]

Update

1a1f480

[ghstack-poisoned]

Update

7acbe9c

[ghstack-poisoned]

Update

d65bf76

[ghstack-poisoned]

albanD approved these changes Jan 9, 2025

View reviewed changes

pytorchbot mentioned this pull request Jan 10, 2025

[v.2.6.0] Release Tracker #142814

Closed

guangyey mentioned this pull request Jan 10, 2025

torch.accelerator.is_available() raise RuntimeError if no available CUDA/XPU devices #144567

Closed

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jan 10, 2025

pytorchmergebot reopened this Jan 10, 2025

guangyey mentioned this pull request Jan 13, 2025

Generalize poison fork logic for each device backend #144664

Closed

guangyey added 3 commits January 13, 2025 10:12

Update

e2b328c

[ghstack-poisoned]

Update

28c1ce7

[ghstack-poisoned]

Update

8703373

[ghstack-poisoned]

albanD reviewed Jan 14, 2025

View reviewed changes

Update

706b16d

[ghstack-poisoned]

guangyey commented Jan 15, 2025

View reviewed changes

guangyey requested a review from albanD January 15, 2025 10:06

Update

719c661

[ghstack-poisoned]

Update

4636c6f

[ghstack-poisoned]

atalman added this to the 2.6.1 milestone Jan 21, 2025

guangyey changed the title ~~Fix poision child process issue when call getAccelerator()~~ [Don't Merge] Fix poision child process issue when call getAccelerator() Jan 22, 2025

guangyey closed this Feb 8, 2025

github-actions bot deleted the gh/guangyey/114/head branch March 11, 2025 02:08

[Don't Merge] Fix poision child process issue when call getAccelerator() #144368

[Don't Merge] Fix poision child process issue when call getAccelerator() #144368

Uh oh!

Conversation

guangyey commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solution

Uh oh!

pytorch-bot bot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144368

❌ 4 New Failures, 2 Unrelated Failures

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

malfet Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

pytorchbot commented Jan 10, 2025

Cherry picking #144368

Uh oh!

clee2000 commented Jan 10, 2025

Uh oh!

pytorchmergebot commented Jan 10, 2025

Uh oh!

pytorchmergebot commented Jan 10, 2025

Uh oh!

albanD Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

albanD commented Jan 16, 2025

Uh oh!

guangyey commented Jan 17, 2025

Uh oh!

albanD commented Jan 31, 2025

Uh oh!

guangyey commented Feb 8, 2025

Uh oh!

Uh oh!

guangyey commented Jan 8, 2025 •

edited

Loading

pytorch-bot bot commented Jan 8, 2025 •

edited

Loading