Skip to content

Unknown CUDA graph CaptureStatus21852 #91970

@sheilaliuxl

Description

@sheilaliuxl

🐛 Describe the bug

Hi there,

We're getting unknown CUDA graph errors with PyTorch 1.13.1. Though it is flaky, it shows up twice, and might be worthwhile looking into & getting fixed.

Here is the stack trace:

[1,17]<stdout>:[2023-01-06 19 (tel:2023010619):46:19.112: C smdistributed/modelparallel/torch/worker.py:110] [17] Hit an exception for 9008/0 on thread 1: false INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus21852
[1,17]<stdout>:[2023-01-06 19 (tel:2023010619):46:19.114: C smdistributed/modelparallel/torch/worker.py:115] [17]   File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py", line 515, in thread_compute
[1,17]<stdout>:    self.thread_execute_backward(req)
[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py", line 486, in thread_execute_backward
[1,17]<stdout>:    self._bwd_aggregated_execute(req, mod, parent_mod)
[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py", line 415, in _bwd_aggregated_execute
[1,17]<stdout>:    torch.autograd.backward(all_outputs, all_grads, retain_graph=retain)
[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/*init*.py", line 197, in backward
[1,17]<stdout>:    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
[1,17]<stdout>:    return user_fn(self, *args)
[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/patches/checkpoint.py", line 228, in backward
[1,17]<stdout>:    outputs = ctx.run_function(*_args, *kwargs)*
*[1,17]<stdout>:  File "/opt/conda/lib/python3.9/contextlib.py", line 137, in __exit_*
*[1,17]<stdout>:    self.gen.throw(typ, value, traceback)*
*[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/random.py", line 129, in fork_rng*
*[1,17]<stdout>:    torch.cuda.set_rng_state(gpu_rng_state, device)*
*[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/random.py", line 64, in set_rng_state*
*[1,17]<stdout>:    lazy_call(cb)*
*[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/__init_.py", line 165, in _lazy_call*
*[1,17]<stdout>:    callable()*
*[1,17]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/random.py", line 62, in cb*
*[1,17]<stdout>:    default_generator.set_state(new_state_copy)*
*[1,17]<stdout>:*
*[1,17]<stdout>:[**2023-01-06 19* (tel:2023010619)*:46:19.114: C smdistributed/modelparallel/torch/worker.py:116] [17] Parent exec stack ['main', 'main/module', 'main/module/module', 'main/module/module/transformer', 'main/module/module/transformer/seq_layers', 'main/module/module/transformer/seq_layers/30', 'main/module/module/transformer/seq_layers/30/attention']*
*[1,17]<stdout>:[**2023-01-06 19* (tel:2023010619)*:46:19.114: C smdistributed/modelparallel/torch/worker.py:117] [17] Req <ModExecReq::BWD::mb:0, module:main, sender_module: main, requester:0, executor:0, position: -1>*
*[1,17]<stderr>:[compute-st-worker-26:00059]* * Process received signal 
*[1,17]<stderr>:[compute-st-worker-26:00059] Signal: Segmentation fault (11)*
*[1,17]<stderr>:[compute-st-worker-26:00059] Signal code: Address not mapped (1)*
*[1,17]<stderr>:[compute-st-worker-26:00059] Failing at address: (nil)*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f79b11b8420]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 1] /opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x2735d)[0x7f78c176b35d]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 2] /opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x43183)[0x7f78c1787183]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 3] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x44c38)[0x7f78c1788c38]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 4] /opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x44e92)[0x7f78c1788e92]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 5] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at6detail13empty_genericEN3c108ArrayRefIlEEPNS1_9AllocatorENS1_14DispatchKeySetENS1_10ScalarTypeENS1_8optionalINS1_12MemoryFormatEEE+0xabf)[0x7f78c317e2bf]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 6] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN2at6detail10empty_cudaEN3c108ArrayRefIlEENS1_10ScalarTypeENS1_8optionalINS1_6DeviceEEENS5_INS1_12MemoryFormatEEE+0x111)[0x7f78dca67ac1]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 7] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN2at6detail10empty_cudaEN3c108ArrayRefIlEENS1_8optionalINS1_10ScalarTypeEEENS4_INS1_6LayoutEEENS4_INS1_6DeviceEEENS4_IbEENS4_INS1_12MemoryFormatEEE+0x31)[0x7f78dca67d91]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 8] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN2at6detail10empty_cudaEN3c108ArrayRefIlEERKNS1_13TensorOptionsE+0x10f)[0x7f78dca67eff]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 9] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2d1622b)[0x7f789949022b]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [10] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2dda806)[0x7f7899554806]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [11] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at4meta14structured_cat4metaERKN3c108IListRefINS_6TensorEEEl+0xc09)[0x7f78c383a339]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [12] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2d23c27)[0x7f789949dc27]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [13] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2d23cd0)[0x7f789949dcd0]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [14] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops3cat10redispatchEN3c1014DispatchKeySetERKNS2_8IListRefINS_6TensorEEEl+0x78)[0x7f78c3c561f8]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [15] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(+0x3dc37a1)[0x7f78c56057a1]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [16] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(+0x3dc43d3)[0x7f78c56063d3]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [17] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops3cat4callERKN3c108IListRefINS_6TensorEEEl+0x1a9)[0x7f78c3c99929]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [18] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0x5c83d3)[0x7f78ec50a3d3]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [19] [1,17]<stderr>:/opt/conda/bin/python(+0x14e26c)[0x555c0737c26c]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [20] [1,17]<stderr>:/opt/conda/bin/python(PyObject_Call+0x157)[0x555c0737a487]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [21] [1,17]<stderr>:/opt/conda/bin/python(_PyEval_EvalFrameDefault+0x5f20)[0x555c0735f860]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [22] /opt/conda/bin/python(+0x12a967)[0x555c07358967]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [23] [1,17]<stderr>:/opt/conda/bin/python(_PyFunction_Vectorcall+0xb9)[0x555c0736ad39]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [24] [1,17]<stderr>:/opt/conda/bin/python(_PyEval_EvalFrameDefault+0x3c3)[0x555c07359d03]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [25] /opt/conda/bin/python(+0x12a967)[0x555c07358967]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [26] /opt/conda/bin/python(_PyFunction_Vectorcall+0xb9)[0x555c0736ad39]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [27] [1,17]<stderr>:/opt/conda/bin/python(PyObject_Call+0xb4)[0x555c0737a3e4]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [28] [1,17]<stderr>:/opt/conda/bin/python(_PyEval_EvalFrameDefault+0x39fa)[0x555c0735d33a]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [29] /opt/conda/bin/python(+0x12a967)[0x555c07358967]*
*[1,17]<stderr>:[compute-st-worker-26:00059] * End of error message *
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun.real noticed that process rank 17 with PID 59 on node compute-st-worker-26 exited on signal 11 (Segmentation fault).

Thank you.

Versions

# python collect_env.py
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.24.3
Libc version: glibc-2.31

Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1080-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] sagemaker-pytorch-training==2.7.0
[pip3] torch==1.13.1+cu117
[pip3] torchaudio==0.13.1+cu117
[pip3] torchdata==0.5.1
[pip3] torchnet==0.0.4
[pip3] torchvision==0.14.1+cu117
[conda] Could not collect
# pip show conda
Name: conda
Version: 22.11.1
Summary: OS-agnostic, system-level binary package manager.
Home-page: https://github.com/conda/conda
Author: Anaconda, Inc.
Author-email: conda@continuum.io
License: BSD-3-Clause
Location: /opt/conda/lib/python3.9/site-packages
Requires: pluggy, pycosat, requests, ruamel.yaml, tqdm
Required-by: mamba

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions