Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in test suite: an illegal memory access was encountered #41340

Closed
Flamefire opened this issue Jul 13, 2020 · 4 comments
Closed

Error in test suite: an illegal memory access was encountered #41340

Flamefire opened this issue Jul 13, 2020 · 4 comments
Labels
high priority module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Flamefire
Copy link
Collaborator

Flamefire commented Jul 13, 2020

🐛 Bug

Running the test suite fails on our system. The issue seems to be with TestTorchDeviceTypeCUDA where starting with test_blas_alpha_beta_empty_cuda_float16 all tests fail with RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Steps to reproduce the behavior:

  1. python run_tests.py

One of the traceback is:

Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 241, in instantiated_test
    result = test(self, device_arg, dtype)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 411, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test_torch.py", line 13909, in test_blas_alpha_beta_empty
    torch.addmv(input=input, mat=mat, vec=vec, alpha=alpha, beta=beta))
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1080, in assertEqual
    exact_device=exact_device)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 971, in _compareTensors
    return _compare_tensors_internal(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/__init__.py", line 122, in _compare_tensors_internal
    if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan):
RuntimeError: CUDA error: an illegal memory access was encountered

Maybe related to #21819 or #36722

Environment

PyTorch version: 1.6.0-rc2
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.8 (Maipo)
GCC version: (GCC) 8.3.0
CMake version: version 3.15.3

Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80

Nvidia driver version: 450.36.06
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.3

cc @ezyang @gchanan @zou3519 @csarofeen @ptrblck @ngimel

@ailzhang ailzhang added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 13, 2020
@ngimel ngimel added module: cublas Problem related to cublas support high priority labels Jul 13, 2020
@ngimel
Copy link
Collaborator

ngimel commented Jul 13, 2020

Setting high priority because it's a crash.

@albanD
Copy link
Collaborator

albanD commented Jul 13, 2020

High priority to at least reproduce and confirm that this is related to K80.

@ptrblck
Copy link
Collaborator

ptrblck commented Jul 13, 2020

We are trying to setup a node with a K80 GPU to reproduce this issue (waiting for our cluster team, if that would be possible today).
Faster workaround could be to setup a machine with a Titan Z, which might be able to reproduce it.

@Flamefire
Copy link
Collaborator Author

I can reproduce this when running a single test. Example:

  • pytest test_torch.py -v -k test_blas_alpha_beta_empty_cuda_float32 - SUCCESS
  • pytest test_torch.py -v -k test_blas_alpha_beta_empty_cuda_float16 - FAIL

So it seems to be related to the float16 type although it doesn't affect all tests. E.g. test_atanh_inplace_cuda_float16 works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants