Skip to content

Conversation

r-barnes
Copy link
Contributor

@r-barnes r-barnes commented Jan 15, 2023

Fixes #91758. Still leaves functions on the hotpath.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92227

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Failures

As of commit 355e925:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@r-barnes r-barnes changed the title Fix CUDA error not getting captured by handler [NOLAND] Fix CUDA error not getting captured by handler Jan 15, 2023
@r-barnes r-barnes force-pushed the richard/cuda_error_fix branch from 9cba23f to 5c9e5a6 Compare January 15, 2023 20:09
@r-barnes r-barnes changed the title [NOLAND] Fix CUDA error not getting captured by handler Fix CUDA error not getting captured by handler Jan 15, 2023
@r-barnes r-barnes force-pushed the richard/cuda_error_fix branch from 5c9e5a6 to bde810e Compare January 15, 2023 20:11
@r-barnes r-barnes requested review from malfet and ngimel January 15, 2023 20:13
@r-barnes r-barnes force-pushed the richard/cuda_error_fix branch from 74a8dae to ee88aa9 Compare January 16, 2023 16:46
@r-barnes
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 16, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64-mps / Run MPS tests

Details for Dev Infra team Raised by workflow job

@malfet malfet added the topic: not user facing topic category label Jan 17, 2023
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-barnes as the follow up, can we please add a test (using c++ extensions) to avoid those regressions in the future

@malfet
Copy link
Contributor

malfet commented Jan 17, 2023

@pytorchbot merge -f "MacOS failures are clearly unrelated to CUDA-specific change"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jan 28, 2023
#93192)

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally

This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256.

The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI?
```
python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time
```

E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192
```
2023-01-27T19:41:32.2312162Z   test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s)
```

The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42

This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503

> soumith on Feb 21, 2019:
Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do.
Pull Request resolved: #93192
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

C10_CUDA_KERNEL_LAUNCH_CHECK() becomes a no-op
4 participants