Error in test suite: an illegal memory access was encountered #41340

Flamefire · 2020-07-13T14:36:22Z

🐛 Bug

Running the test suite fails on our system. The issue seems to be with TestTorchDeviceTypeCUDA where starting with test_blas_alpha_beta_empty_cuda_float16 all tests fail with RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Steps to reproduce the behavior:

python run_tests.py

One of the traceback is:

Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 241, in instantiated_test
    result = test(self, device_arg, dtype)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 411, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test_torch.py", line 13909, in test_blas_alpha_beta_empty
    torch.addmv(input=input, mat=mat, vec=vec, alpha=alpha, beta=beta))
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1080, in assertEqual
    exact_device=exact_device)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 971, in _compareTensors
    return _compare_tensors_internal(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/__init__.py", line 122, in _compare_tensors_internal
    if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan):
RuntimeError: CUDA error: an illegal memory access was encountered

Maybe related to #21819 or #36722

Environment

PyTorch version: 1.6.0-rc2
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0
CMake version: version 3.15.3

Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80

Nvidia driver version: 450.36.06
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.3

cc @ezyang @gchanan @zou3519 @csarofeen @ptrblck @ngimel

The text was updated successfully, but these errors were encountered:

ngimel · 2020-07-13T16:40:29Z

Setting high priority because it's a crash.

albanD · 2020-07-13T17:18:11Z

High priority to at least reproduce and confirm that this is related to K80.

ptrblck · 2020-07-13T20:09:08Z

We are trying to setup a node with a K80 GPU to reproduce this issue (waiting for our cluster team, if that would be possible today).
Faster workaround could be to setup a machine with a Titan Z, which might be able to reproduce it.

Flamefire · 2020-07-14T10:09:31Z

I can reproduce this when running a single test. Example:

pytest test_torch.py -v -k test_blas_alpha_beta_empty_cuda_float32 - SUCCESS
pytest test_torch.py -v -k test_blas_alpha_beta_empty_cuda_float16 - FAIL

So it seems to be related to the float16 type although it doesn't affect all tests. E.g. test_atanh_inplace_cuda_float16 works

ailzhang added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 13, 2020

ngimel added module: cublas Problem related to cublas support high priority labels Jul 13, 2020

pytorch-probot bot added the triage review label Jul 13, 2020

albanD removed the triage review label Jul 13, 2020

This was referenced Jul 14, 2020

test_torch.py fails with nan results #41401

Closed

test_torch.py fails if not build with MKL #41402

Closed

zasdfgbnm mentioned this issue Jul 22, 2020

Improve zero sized input for addmv #41824

Closed

facebook-github-bot closed this as completed in aef2890 Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in test suite: an illegal memory access was encountered #41340

Error in test suite: an illegal memory access was encountered #41340

Flamefire commented Jul 13, 2020 •

edited by pytorch-probot bot

Loading

ngimel commented Jul 13, 2020

albanD commented Jul 13, 2020

ptrblck commented Jul 13, 2020

Flamefire commented Jul 14, 2020

Error in test suite: an illegal memory access was encountered #41340

Error in test suite: an illegal memory access was encountered #41340

Comments

Flamefire commented Jul 13, 2020 • edited by pytorch-probot bot Loading

🐛 Bug

To Reproduce

Environment

ngimel commented Jul 13, 2020

albanD commented Jul 13, 2020

ptrblck commented Jul 13, 2020

Flamefire commented Jul 14, 2020

Flamefire commented Jul 13, 2020 •

edited by pytorch-probot bot

Loading