Skip to content

RuntimeError: CUDA error: an illegal memory access was encountered with channels_last #37449

@tstandley

Description

@tstandley

I get an illegal memory access when trying to train mnasnet (any version) with apex (O1) and channels_last

To Reproduce

Steps to reproduce the behavior:

use the apex imagenet example:

python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a=mnasnet1_3 --b 224 --workers 4 --channels-last=True --opt-level=O1 -b=256 /intel_nvme/imagenet_data/

Traceback (most recent call last):
File "main_amp.py", line 542, in
main()
File "main_amp.py", line 247, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main_amp.py", line 353, in train
scaled_loss.backward()
File "/home/tstand/anaconda3/lib/python3.7/contextlib.py", line 119, in exit
next(self.gen)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 184, in unscale_with_stashed
out_scale/stashed_have_scale)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 148, in unscale_with_stashed_python
self.dynamic)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 22, in axpby_check_overflow_python
cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f5827507536 in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f582774afbe in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f58274f7abd in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x5236b2 (0x7f58732c06b2 in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x523756 (0x7f58732c0756 in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x19dfce (0x55748c40bfce in /home/tstand/anaconda3/bin/python)
frame #6: + 0x103948 (0x55748c371948 in /home/tstand/anaconda3/bin/python)
frame #7: + 0x114267 (0x55748c382267 in /home/tstand/anaconda3/bin/python)
frame #8: + 0x11427d (0x55748c38227d in /home/tstand/anaconda3/bin/python)
frame #9: + 0x11427d (0x55748c38227d in /home/tstand/anaconda3/bin/python)
frame #10: PyDict_SetItem + 0x502 (0x55748c3cd602 in /home/tstand/anaconda3/bin/python)
frame #11: PyDict_SetItemString + 0x4f (0x55748c3ce0cf in /home/tstand/anaconda3/bin/python)
frame #12: PyImport_Cleanup + 0x9e (0x55748c40d91e in /home/tstand/anaconda3/bin/python)
frame #13: Py_FinalizeEx + 0x67 (0x55748c483367 in /home/tstand/anaconda3/bin/python)
frame #14: + 0x227d93 (0x55748c495d93 in /home/tstand/anaconda3/bin/python)
frame #15: _Py_UnixMain + 0x3c (0x55748c4960bc in /home/tstand/anaconda3/bin/python)
frame #16: __libc_start_main + 0xf3 (0x7f5875ba81e3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: + 0x1d0990 (0x55748c43e990 in /home/tstand/anaconda3/bin/python)

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
Traceback (most recent call last):
File "main_amp.py", line 542, in
main()
File "main_amp.py", line 247, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main_amp.py", line 353, in train
scaled_loss.backward()
File "/home/tstand/anaconda3/lib/python3.7/contextlib.py", line 119, in exit
next(self.gen)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 184, in unscale_with_stashed
out_scale/stashed_have_scale)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 148, in unscale_with_stashed_python
self.dynamic)
File "/home/tstand/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 22, in axpby_check_overflow_python
cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f911c250536 in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f911c493fbe in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f911c240abd in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x5236b2 (0x7f91680096b2 in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x523756 (0x7f9168009756 in /home/tstand/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x19dfce (0x5599d63f8fce in /home/tstand/anaconda3/bin/python)
frame #6: + 0x103948 (0x5599d635e948 in /home/tstand/anaconda3/bin/python)
frame #7: + 0x114267 (0x5599d636f267 in /home/tstand/anaconda3/bin/python)
frame #8: + 0x11427d (0x5599d636f27d in /home/tstand/anaconda3/bin/python)
frame #9: + 0x11427d (0x5599d636f27d in /home/tstand/anaconda3/bin/python)
frame #10: PyDict_SetItem + 0x502 (0x5599d63ba602 in /home/tstand/anaconda3/bin/python)
frame #11: PyDict_SetItemString + 0x4f (0x5599d63bb0cf in /home/tstand/anaconda3/bin/python)
frame #12: PyImport_Cleanup + 0x9e (0x5599d63fa91e in /home/tstand/anaconda3/bin/python)
frame #13: Py_FinalizeEx + 0x67 (0x5599d6470367 in /home/tstand/anaconda3/bin/python)
frame #14: + 0x227d93 (0x5599d6482d93 in /home/tstand/anaconda3/bin/python)
frame #15: _Py_UnixMain + 0x3c (0x5599d64830bc in /home/tstand/anaconda3/bin/python)
frame #16: __libc_start_main + 0xf3 (0x7f916a8f11e3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: + 0x1d0990 (0x5599d642b990 in /home/tstand/anaconda3/bin/python)

Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 19.10
GCC version: (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect (not installed outside pytorch)
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX

Nvidia driver version: 440.82
cuDNN version: Could not collect (not installed outside pytorch)

Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.5.0
[pip3] torchvision==0.6.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_0
[conda] mkl 2020.0 166
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] numpy 1.18.1 py37h4f9e942_0
[conda] numpy-base 1.18.1 py37hde5b4d6_1
[conda] numpydoc 0.9.2 py_0 conda-forge
[conda] pytorch 1.5.0 py3.7_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] torchvision 0.6.0 py37_cu102 pytorch

cc @ezyang @gchanan @zou3519 @bdhirsh @heitorschueroff @seemethere @malfet @walterddr @ngimel @csarofeen @ptrblck

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: binariesAnything related to official binaries that we release to usersmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: cudnnRelated to torch.backends.cudnn, and CuDNN supportmodule: dependency bugProblem is not caused by us, but caused by an upstream library we usetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions