-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Open
Labels
needs reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.Ensure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
Hi there,
We're getting unknown CUDA graph errors with PyTorch 1.13.1. Though it is flaky, it shows up twice, and might be worthwhile looking into & getting fixed.
Here is the stack trace:
[1,17]<stdout>:[2023-01-06 19 (tel:2023010619):46:19.112: C smdistributed/modelparallel/torch/worker.py:110] [17] Hit an exception for 9008/0 on thread 1: false INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus21852
[1,17]<stdout>:[2023-01-06 19 (tel:2023010619):46:19.114: C smdistributed/modelparallel/torch/worker.py:115] [17] File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py", line 515, in thread_compute
[1,17]<stdout>: self.thread_execute_backward(req)
[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py", line 486, in thread_execute_backward
[1,17]<stdout>: self._bwd_aggregated_execute(req, mod, parent_mod)
[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py", line 415, in _bwd_aggregated_execute
[1,17]<stdout>: torch.autograd.backward(all_outputs, all_grads, retain_graph=retain)
[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/*init*.py", line 197, in backward
[1,17]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
[1,17]<stdout>: return user_fn(self, *args)
[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/patches/checkpoint.py", line 228, in backward
[1,17]<stdout>: outputs = ctx.run_function(*_args, *kwargs)*
*[1,17]<stdout>: File "/opt/conda/lib/python3.9/contextlib.py", line 137, in __exit_*
*[1,17]<stdout>: self.gen.throw(typ, value, traceback)*
*[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/random.py", line 129, in fork_rng*
*[1,17]<stdout>: torch.cuda.set_rng_state(gpu_rng_state, device)*
*[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/cuda/random.py", line 64, in set_rng_state*
*[1,17]<stdout>: lazy_call(cb)*
*[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/cuda/__init_.py", line 165, in _lazy_call*
*[1,17]<stdout>: callable()*
*[1,17]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/cuda/random.py", line 62, in cb*
*[1,17]<stdout>: default_generator.set_state(new_state_copy)*
*[1,17]<stdout>:*
*[1,17]<stdout>:[**2023-01-06 19* (tel:2023010619)*:46:19.114: C smdistributed/modelparallel/torch/worker.py:116] [17] Parent exec stack ['main', 'main/module', 'main/module/module', 'main/module/module/transformer', 'main/module/module/transformer/seq_layers', 'main/module/module/transformer/seq_layers/30', 'main/module/module/transformer/seq_layers/30/attention']*
*[1,17]<stdout>:[**2023-01-06 19* (tel:2023010619)*:46:19.114: C smdistributed/modelparallel/torch/worker.py:117] [17] Req <ModExecReq::BWD::mb:0, module:main, sender_module: main, requester:0, executor:0, position: -1>*
*[1,17]<stderr>:[compute-st-worker-26:00059]* * Process received signal
*[1,17]<stderr>:[compute-st-worker-26:00059] Signal: Segmentation fault (11)*
*[1,17]<stderr>:[compute-st-worker-26:00059] Signal code: Address not mapped (1)*
*[1,17]<stderr>:[compute-st-worker-26:00059] Failing at address: (nil)*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f79b11b8420]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 1] /opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x2735d)[0x7f78c176b35d]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 2] /opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x43183)[0x7f78c1787183]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 3] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x44c38)[0x7f78c1788c38]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 4] /opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so(+0x44e92)[0x7f78c1788e92]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 5] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at6detail13empty_genericEN3c108ArrayRefIlEEPNS1_9AllocatorENS1_14DispatchKeySetENS1_10ScalarTypeENS1_8optionalINS1_12MemoryFormatEEE+0xabf)[0x7f78c317e2bf]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 6] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN2at6detail10empty_cudaEN3c108ArrayRefIlEENS1_10ScalarTypeENS1_8optionalINS1_6DeviceEEENS5_INS1_12MemoryFormatEEE+0x111)[0x7f78dca67ac1]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 7] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN2at6detail10empty_cudaEN3c108ArrayRefIlEENS1_8optionalINS1_10ScalarTypeEEENS4_INS1_6LayoutEEENS4_INS1_6DeviceEEENS4_IbEENS4_INS1_12MemoryFormatEEE+0x31)[0x7f78dca67d91]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 8] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN2at6detail10empty_cudaEN3c108ArrayRefIlEERKNS1_13TensorOptionsE+0x10f)[0x7f78dca67eff]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [ 9] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2d1622b)[0x7f789949022b]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [10] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2dda806)[0x7f7899554806]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [11] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at4meta14structured_cat4metaERKN3c108IListRefINS_6TensorEEEl+0xc09)[0x7f78c383a339]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [12] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2d23c27)[0x7f789949dc27]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [13] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so(+0x2d23cd0)[0x7f789949dcd0]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [14] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops3cat10redispatchEN3c1014DispatchKeySetERKNS2_8IListRefINS_6TensorEEEl+0x78)[0x7f78c3c561f8]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [15] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(+0x3dc37a1)[0x7f78c56057a1]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [16] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(+0x3dc43d3)[0x7f78c56063d3]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [17] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops3cat4callERKN3c108IListRefINS_6TensorEEEl+0x1a9)[0x7f78c3c99929]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [18] [1,17]<stderr>:/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0x5c83d3)[0x7f78ec50a3d3]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [19] [1,17]<stderr>:/opt/conda/bin/python(+0x14e26c)[0x555c0737c26c]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [20] [1,17]<stderr>:/opt/conda/bin/python(PyObject_Call+0x157)[0x555c0737a487]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [21] [1,17]<stderr>:/opt/conda/bin/python(_PyEval_EvalFrameDefault+0x5f20)[0x555c0735f860]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [22] /opt/conda/bin/python(+0x12a967)[0x555c07358967]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [23] [1,17]<stderr>:/opt/conda/bin/python(_PyFunction_Vectorcall+0xb9)[0x555c0736ad39]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [24] [1,17]<stderr>:/opt/conda/bin/python(_PyEval_EvalFrameDefault+0x3c3)[0x555c07359d03]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [25] /opt/conda/bin/python(+0x12a967)[0x555c07358967]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [26] /opt/conda/bin/python(_PyFunction_Vectorcall+0xb9)[0x555c0736ad39]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [27] [1,17]<stderr>:/opt/conda/bin/python(PyObject_Call+0xb4)[0x555c0737a3e4]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [28] [1,17]<stderr>:/opt/conda/bin/python(_PyEval_EvalFrameDefault+0x39fa)[0x555c0735d33a]*
*[1,17]<stderr>:[compute-st-worker-26:00059] [29] /opt/conda/bin/python(+0x12a967)[0x555c07358967]*
*[1,17]<stderr>:[compute-st-worker-26:00059] * End of error message *
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun.real noticed that process rank 17 with PID 59 on node compute-st-worker-26 exited on signal 11 (Segmentation fault).
Thank you.
Versions
# python collect_env.py
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.24.3
Libc version: glibc-2.31
Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1080-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] sagemaker-pytorch-training==2.7.0
[pip3] torch==1.13.1+cu117
[pip3] torchaudio==0.13.1+cu117
[pip3] torchdata==0.5.1
[pip3] torchnet==0.0.4
[pip3] torchvision==0.14.1+cu117
[conda] Could not collect
# pip show conda
Name: conda
Version: 22.11.1
Summary: OS-agnostic, system-level binary package manager.
Home-page: https://github.com/conda/conda
Author: Anaconda, Inc.
Author-email: conda@continuum.io
License: BSD-3-Clause
Location: /opt/conda/lib/python3.9/site-packages
Requires: pluggy, pycosat, requests, ruamel.yaml, tqdm
Required-by: mamba
Metadata
Metadata
Assignees
Labels
needs reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.Ensure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module