"Memory access fault by GPU node-1" error in Conv3d. #718

ghost · 2019-11-02T13:42:13Z

🐛 Bug

Got "Memory access fault by GPU node-1" when training my model, now I can reproduce the problem in a very simple script.
the env is ROCM 2.9.6, Radeon VII, I compiled pytorch from the most recent source on master branch.
details as follow.

To Reproduce

import torch
import torch.nn as nn
t=torch.rand(2,32,64,128,160).to('cuda')
t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t) #error occurs.

Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
import torch.nn as nn
t=torch.rand(2,32,64,128,160).to('cuda')
HIP_DB=0x1 [api]
hip-api pid:9748 tid:1:HIP initialized short_tid#1 (maps to full_tid: 0x7fba8044f740)
t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t)
<<hip-api pid:9748 tid:1.63 9748 1.63 hipLaunchKernel 'ZN12_GLOBAL__N_110hip_fill_nILj256EPjmjEEvT0_T1_T2' gridDim:{163840,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5334006293209
<<hip-api pid:9748 tid:1.69 9748 1.69 hipLaunchKernel 'ZN2at6native14vol2col_kernelIfEEviPKT_iiiiiiiiiiiiiiiiiiPS2' gridDim:{40960,1,1} groupDim:{1024,1,1} sharedMem:+0 stream:0.0 @5340563243577
<<hip-api pid:9748 tid:1.409 9748 1.409 hipLaunchKernel 'Cijk_Ailk_Bljk_SB_MT128x64x8_SE_APM1_AF0EM1_AF1EM1_AMAS3_ASEM1_BL1_DTL0_EPS1_FL1_GRVW4_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_MGWVW1_NLCA1_NLCB1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT8_4_USFGRO0_VAW1_VW4_WG16_16_1_WGM8' gridDim:{10240,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5340572207622
Memory access fault by GPU node-1 (Agent handle: 0x55e2fa08a6f0) on address 0x7fb968e02000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Environment

ROCM Version: 2.9.6

PyTorch version: 1.4.0a0+21ab112
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.3
[pip] torch==1.4.0a0+21ab112
[pip] torchvision==0.2.0
[conda] mkl 2019.4 243
[conda] mkl-include 2019.4 243

iotamudelta · 2019-11-04T01:16:02Z

This looks like an issue with MIOpen. Transferring over.

iotamudelta · 2019-11-04T16:48:05Z

@daniellowell can reproduce issue. Logging shows it is this rocBLAS call:

# MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=7 MIOPEN_ENABLE_LOGGING_CMD=1 ROCBLAS_LAYER=2 python3.6 breakme.py 
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1310720 -n 16 -k 864 --alpha 1 --lda 1310720 --ldb 864 --beta 1 --ldc 1310720
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1310720 -n 16 -k 864 --alpha 1 --lda 1310720 --ldb 864 --beta 1 --ldc 1310720
Memory access fault by GPU node-2 (Agent handle: 0x4464cb0) on address 0x7f12f3701000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

daniellowell · 2019-11-04T17:01:11Z

@amcamd Can you test the two configs above. @singvision is seeing a segfault. It is pointing to rocBLAS, but it could be the way MIOpen is configuring the parameters.

dagamayank · 2019-11-04T19:41:12Z

/cc @bragadeesh

ghost · 2019-11-11T12:46:52Z

any progress on this issue? @daniellowell @amcamd

ghost · 2019-11-28T07:18:59Z

problem still exist on ROCM 2.10.

sugar-mouse · 2020-06-12T08:56:41Z

is there someone following up? I encounter this error, too. is it a bug?

dodatko · 2020-06-14T09:43:00Z

yes, I send back my Radeon VII to seller and have switched to RTX 2070 because this problem.

iotamudelta transferred this issue from ROCm/pytorch Nov 4, 2019

daniellowell self-assigned this Nov 4, 2019

daniellowell transferred this issue from ROCm/MIOpen Nov 4, 2019

amcamd transferred this issue from ROCm/rocBLAS Nov 5, 2019

bragadeesh closed this as completed Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Memory access fault by GPU node-1" error in Conv3d. #718

"Memory access fault by GPU node-1" error in Conv3d. #718

ghost commented Nov 2, 2019

iotamudelta commented Nov 4, 2019

iotamudelta commented Nov 4, 2019

daniellowell commented Nov 4, 2019

dagamayank commented Nov 4, 2019

ghost commented Nov 11, 2019

ghost commented Nov 28, 2019

sugar-mouse commented Jun 12, 2020

dodatko commented Jun 14, 2020 •

edited

Loading

"Memory access fault by GPU node-1" error in Conv3d. #718

"Memory access fault by GPU node-1" error in Conv3d. #718

Comments

ghost commented Nov 2, 2019

🐛 Bug

To Reproduce

Environment

iotamudelta commented Nov 4, 2019

iotamudelta commented Nov 4, 2019

daniellowell commented Nov 4, 2019

dagamayank commented Nov 4, 2019

ghost commented Nov 11, 2019

ghost commented Nov 28, 2019

sugar-mouse commented Jun 12, 2020

dodatko commented Jun 14, 2020 • edited Loading

dodatko commented Jun 14, 2020 •

edited

Loading