Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cache device_count if we haven't initialized CUDA yet #122815

Closed
wants to merge 3 commits into from

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented Mar 27, 2024

Stack from ghstack (oldest at bottom):

Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES

Fixes #122085
Fixes #38616
Fixes #110000
Fixes #110971
Fixes #95073

Signed-off-by: Edward Z. Yang ezyang@meta.com

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Mar 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122815

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a2889f2 with merge base 852111e (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang added a commit that referenced this pull request Mar 27, 2024
Before initializing CUDA, it can change by modifying
CUDA_VISIBLE_DEVICES

Fixes #122085
Fixes #38616
Fixes #110000
Fixes #110971

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 983dcc60d00eb816a8b1e42acec9e6a6b7478510
Pull Request resolved: #122815
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

# NB: Do not cache the device count prior to CUDA initialization, because
# the number of devices can change due to changes to CUDA_VISIBLE_DEVICES
# setting prior to CUDA initialization.
if _cached_device_count is None and _initialized:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _cached_device_count can only be None if we're here

@albanD albanD added release notes: cuda release notes category topic: not user facing topic category labels Mar 27, 2024
[ghstack-poisoned]
ezyang added a commit that referenced this pull request Mar 27, 2024
Before initializing CUDA, it can change by modifying
CUDA_VISIBLE_DEVICES

Fixes #122085
Fixes #38616
Fixes #110000
Fixes #110971

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 1f87ece6f87b662ec5d0e110efe0ebf482852c18
Pull Request resolved: #122815
@ezyang
Copy link
Contributor Author

ezyang commented Mar 27, 2024

cc @wyli

@wyli
Copy link
Contributor

wyli commented Mar 27, 2024

Cool, this potentially fixes #95073

@ezyang
Copy link
Contributor Author

ezyang commented Mar 27, 2024

Thanks, I agree.

@ezyang
Copy link
Contributor Author

ezyang commented Mar 27, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 27, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Mar 28, 2024
Before initializing CUDA, it can change by modifying
CUDA_VISIBLE_DEVICES

Fixes #122085
Fixes #38616
Fixes #110000
Fixes #110971

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 437fbaf80973e1ed24239b2ff6397c24b35126c8
Pull Request resolved: #122815
@ezyang
Copy link
Contributor Author

ezyang commented Mar 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: .github/workflows/trunk.yml / linux-focal-rocm6.0-py3.8 / build, .github/workflows/trunk.yml / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team Raised by workflow job

@ezyang
Copy link
Contributor Author

ezyang commented Mar 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…122815)

Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES

Fixes pytorch#122085
Fixes pytorch#38616
Fixes pytorch#110000
Fixes pytorch#110971
Fixes pytorch#95073

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: pytorch#122815
Approved by: https://github.com/albanD
@github-actions github-actions bot deleted the gh/ezyang/2644/head branch April 28, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: cuda release notes category topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants