Fix CUDA error not getting captured by handler #92227

r-barnes · 2023-01-15T19:59:05Z

Fixes #91758. Still leaves functions on the hotpath.

pytorch-bot · 2023-01-15T19:59:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92227

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Failures

As of commit 355e925:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

c10/cuda/CUDADeviceAssertionHost.cpp

c10/cuda/CUDAException.h

r-barnes · 2023-01-16T20:51:10Z

@pytorchbot merge

pytorchmergebot · 2023-01-16T20:52:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-16T21:43:23Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64-mps / Run MPS tests

Details for Dev Infra team

Raised by workflow job

malfet

@r-barnes as the follow up, can we please add a test (using c++ extensions) to avoid those regressions in the future

malfet · 2023-01-17T00:14:50Z

@pytorchbot merge -f "MacOS failures are clearly unrelated to CUDA-specific change"

pytorchmergebot · 2023-01-17T00:16:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

#93192) Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256. The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI? ``` python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time ``` E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192 ``` 2023-01-27T19:41:32.2312162Z test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s) ``` The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42 This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503 > soumith on Feb 21, 2019: Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do. Pull Request resolved: #93192 Approved by: https://github.com/ezyang

r-barnes mentioned this pull request Jan 15, 2023

don't swallow cuda errors #92200

Closed

r-barnes changed the title ~~Fix CUDA error not getting captured by handler~~ [NOLAND] Fix CUDA error not getting captured by handler Jan 15, 2023

r-barnes force-pushed the richard/cuda_error_fix branch from 9cba23f to 5c9e5a6 Compare January 15, 2023 20:09

r-barnes changed the title ~~[NOLAND] Fix CUDA error not getting captured by handler~~ Fix CUDA error not getting captured by handler Jan 15, 2023

r-barnes force-pushed the richard/cuda_error_fix branch from 5c9e5a6 to bde810e Compare January 15, 2023 20:11

r-barnes requested review from malfet and ngimel January 15, 2023 20:13

ngimel reviewed Jan 15, 2023

View reviewed changes

c10/cuda/CUDADeviceAssertionHost.cpp Outdated Show resolved Hide resolved

ngimel reviewed Jan 16, 2023

View reviewed changes

c10/cuda/CUDAException.h Show resolved Hide resolved

Fix CUDA error not getting captured by handler

ee88aa9

r-barnes force-pushed the richard/cuda_error_fix branch from 74a8dae to ee88aa9 Compare January 16, 2023 16:46

Fix prototype

355e925

ngimel approved these changes Jan 16, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 16, 2023

malfet approved these changes Jan 17, 2023

View reviewed changes

malfet added the topic: not user facing topic category label Jan 17, 2023

malfet reviewed Jan 17, 2023

View reviewed changes

pytorchmergebot added the Merged label Jan 17, 2023

pytorchmergebot closed this in eadbf76 Jan 17, 2023

ptrblck mentioned this pull request Jan 17, 2023

Stochastic Illegal Memory Access error mid-epoch on AWS p4d instances #91653

Open

xwang233 mentioned this pull request Jan 28, 2023

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally #93192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CUDA error not getting captured by handler #92227

Fix CUDA error not getting captured by handler #92227

Uh oh!

r-barnes commented Jan 15, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 15, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

r-barnes commented Jan 16, 2023

Uh oh!

pytorchmergebot commented Jan 16, 2023

Uh oh!

pytorchmergebot commented Jan 16, 2023

Uh oh!

malfet left a comment

Uh oh!

malfet commented Jan 17, 2023

Uh oh!

pytorchmergebot commented Jan 17, 2023

Uh oh!

Uh oh!

Fix CUDA error not getting captured by handler #92227

Fix CUDA error not getting captured by handler #92227

Uh oh!

Conversation

r-barnes commented Jan 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92227

❌ 2 Failures

Uh oh!

Uh oh!

Uh oh!

r-barnes commented Jan 16, 2023

Uh oh!

pytorchmergebot commented Jan 16, 2023

Merge started

Uh oh!

pytorchmergebot commented Jan 16, 2023

Merge failed

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Jan 17, 2023

Uh oh!

pytorchmergebot commented Jan 17, 2023

Merge started

Uh oh!

Uh oh!

r-barnes commented Jan 15, 2023 •

edited

Loading

pytorch-bot bot commented Jan 15, 2023 •

edited

Loading