Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Memory access fault by GPU node-1" error in Conv3d. #718

Closed
ghost opened this issue Nov 2, 2019 · 8 comments
Closed

"Memory access fault by GPU node-1" error in Conv3d. #718

ghost opened this issue Nov 2, 2019 · 8 comments
Assignees

Comments

@ghost
Copy link

ghost commented Nov 2, 2019

🐛 Bug

Got "Memory access fault by GPU node-1" when training my model, now I can reproduce the problem in a very simple script.
the env is ROCM 2.9.6, Radeon VII, I compiled pytorch from the most recent source on master branch.
details as follow.

To Reproduce

import torch
import torch.nn as nn
t=torch.rand(2,32,64,128,160).to('cuda')
t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t) #error occurs.

Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
import torch.nn as nn
t=torch.rand(2,32,64,128,160).to('cuda')
HIP_DB=0x1 [api]
hip-api pid:9748 tid:1:HIP initialized short_tid#1 (maps to full_tid: 0x7fba8044f740)
t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t)
<<hip-api pid:9748 tid:1.63 9748 1.63 hipLaunchKernel 'ZN12_GLOBAL__N_110hip_fill_nILj256EPjmjEEvT0_T1_T2' gridDim:{163840,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5334006293209
<<hip-api pid:9748 tid:1.69 9748 1.69 hipLaunchKernel 'ZN2at6native14vol2col_kernelIfEEviPKT_iiiiiiiiiiiiiiiiiiPS2' gridDim:{40960,1,1} groupDim:{1024,1,1} sharedMem:+0 stream:0.0 @5340563243577
<<hip-api pid:9748 tid:1.409 9748 1.409 hipLaunchKernel 'Cijk_Ailk_Bljk_SB_MT128x64x8_SE_APM1_AF0EM1_AF1EM1_AMAS3_ASEM1_BL1_DTL0_EPS1_FL1_GRVW4_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_MGWVW1_NLCA1_NLCB1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT8_4_USFGRO0_VAW1_VW4_WG16_16_1_WGM8' gridDim:{10240,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5340572207622
Memory access fault by GPU node-1 (Agent handle: 0x55e2fa08a6f0) on address 0x7fb968e02000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Environment

ROCM Version: 2.9.6

PyTorch version: 1.4.0a0+21ab112
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.3
[pip] torch==1.4.0a0+21ab112
[pip] torchvision==0.2.0
[conda] mkl 2019.4 243
[conda] mkl-include 2019.4 243

@iotamudelta
Copy link

This looks like an issue with MIOpen. Transferring over.

@iotamudelta iotamudelta transferred this issue from ROCm/pytorch Nov 4, 2019
@daniellowell daniellowell self-assigned this Nov 4, 2019
@iotamudelta
Copy link

@daniellowell can reproduce issue. Logging shows it is this rocBLAS call:

# MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=7 MIOPEN_ENABLE_LOGGING_CMD=1 ROCBLAS_LAYER=2 python3.6 breakme.py 
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1310720 -n 16 -k 864 --alpha 1 --lda 1310720 --ldb 864 --beta 1 --ldc 1310720
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1310720 -n 16 -k 864 --alpha 1 --lda 1310720 --ldb 864 --beta 1 --ldc 1310720
Memory access fault by GPU node-2 (Agent handle: 0x4464cb0) on address 0x7f12f3701000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

@daniellowell daniellowell transferred this issue from ROCm/MIOpen Nov 4, 2019
@daniellowell
Copy link
Contributor

@amcamd Can you test the two configs above. @singvision is seeing a segfault. It is pointing to rocBLAS, but it could be the way MIOpen is configuring the parameters.

@dagamayank
Copy link

/cc @bragadeesh

@amcamd amcamd transferred this issue from ROCm/rocBLAS Nov 5, 2019
@ghost
Copy link
Author

ghost commented Nov 11, 2019

any progress on this issue? @daniellowell @amcamd

@ghost
Copy link
Author

ghost commented Nov 28, 2019

problem still exist on ROCM 2.10.

@sugar-mouse
Copy link

is there someone following up? I encounter this error, too. is it a bug?

@dodatko
Copy link

dodatko commented Jun 14, 2020

yes, I send back my Radeon VII to seller and have switched to RTX 2070 because this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants