Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HIP] memory errors affecting Microphysics codes on AMD GPU #447

Open
BenWibking opened this issue Nov 12, 2023 · 4 comments
Open

[HIP] memory errors affecting Microphysics codes on AMD GPU #447

BenWibking opened this issue Nov 12, 2023 · 4 comments
Labels
AMDGPU affects AMD GPUs bug Something isn't working Setonix

Comments

@BenWibking
Copy link
Collaborator

Describe the bug
AMD's AddressSanitizer reports a memory corruption bug in amrex::InitRandom(). It's currently unclear if this is a real bug or a false positive.

To Reproduce
Steps to reproduce the behavior:

  1. Compile the HydroBlast3D problem with the moth-sanitizer.profile settings on Moth (add build profile for AMDGPU ASAN #446).
  2. Run the problem.
  3. See error from ASAN:
bwibking@moth:~/quokka/build> ./src/HydroBlast3D/test_hydro3d_blast ../tests/benchmark_unigrid_256.in
Initializing AMReX (23.10-23-g601cc4ee80e0)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing HIP...
HIP initialized with 1 device.
=================================================================
==1379366==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000326cfe8 at pc 0x7fc7796a5ea7 bp 0x7ffd90522b00 sp 0x7f
fd905222c0
READ of size 32 at 0x00000326cfe8 thread T0
    #0 0x7fc7796a5ea6 in __interceptor_memcpy (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6)
(BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a)
    #1 0x7fc7781440a9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3440a9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #2 0x7fc7781462f6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3462f6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #3 0x7fc7781465a6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3465a6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #4 0x7fc778112434  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x312434) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #5 0x7fc7780dcc53  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x2dcc53) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #6 0x7fc777f835e9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1835e9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #7 0x7fc777e89c0e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x89c0e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #8 0x7fc777fe650e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e650e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #9 0x7fc778010bd9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x210bd9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #10 0x7fc777fe6f91  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e6f91) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #11 0x7fc777ff13e7 in hipLaunchKernel (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1f13e7) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32
f3fbe143d)
    #12 0x2ae25ef in std::enable_if<MaybeDeviceRunnable<(anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)>::val
ue, void>::type amrex::ParallelFor<256, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(amrex::Gp
u::KernelInfo const&, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/quokka/extern/am
rex/Src/Base/AMReX_GpuLaunchFunctsG.H:878:5
    #13 0x2ae25ef in void amrex::ParallelFor<int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(int,
 (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaun
chFunctsG.H:1457:5
    #14 0x2ae25ef in (anonymous namespace)::ResizeRandomSeed(unsigned long) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX_Rando
m.cpp:54:5
    #15 0x2ae25ef in amrex::InitRandom(unsigned long, int, unsigned long) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX_Random.
cpp:104:5
    #16 0x2a21bf9 in amrex::Initialize(int&, char**&, bool, ompi_communicator_t*, std::function<void ()> const&, std::ostream&, std
::ostream&, void (*)(char const*)) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX.cpp:618:5
    #17 0x29fc0cc in main /home/bwibking/quokka/src/main.cpp:22:2
    #18 0x7fc773c3feaf in __libc_start_call_main (/lib64/libc.so.6+0x3feaf) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #19 0x7fc773c3ff5f in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff5f) (BuildId: b39d468aead6d9ede227751ffe093da2874886
48)
    #20 0x2897644 in _start (/home/bwibking/quokka/build/src/HydroBlast3D/test_hydro3d_blast+0x2897644)

0x00000326cfe8 is located 56 bytes before global variable 'EOSData::mindens' defined in '/home/bwibking/quokka/extern/Microphysics/
interfaces/eos_data.cpp' (0x326d020) of size 8
0x00000326cfe8 is located 24 bytes before global variable 'EOSData::maxtemp' defined in '/home/bwibking/quokka/extern/Microphysics/
interfaces/eos_data.cpp' (0x326d000) of size 8
0x00000326cfe8 is located 0 bytes after global variable 'EOSData::mintemp' defined in '/home/bwibking/quokka/extern/Microphysics/in
terfaces/eos_data.cpp' (0x326cfe0) of size 8
SUMMARY: AddressSanitizer: global-buffer-overflow (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5e
a6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a) in __interceptor_memcpy
Shadow bytes around the buggy address:
Shadow bytes around the buggy address:
  0x00000326cd00: 00 00 00 00 f9 f9 f9 f9 01 f9 f9 f9 00 00 00 00
  0x00000326cd80: f9 f9 f9 f9 01 f9 f9 f9 01 f9 f9 f9 00 f9 f9 f9
  0x00000326ce00: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326ce80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326cf00: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
=>0x00000326cf80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00[f9]f9 f9
  0x00000326d000: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326d080: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326d100: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326d180: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 00 00 00
  0x00000326d200: 00 00 00 00 01 f9 f9 f9 00 f9 f9 f9 04 f9 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1379366==ABORTING

Additional context
Although the test suite passes on Moth, we have seen bizarre and incorrect behavior of Quokka when running production simulations on AMD GPUs. We want to rule out whether this is the cause.

@BenWibking BenWibking added bug Something isn't working Setonix AMDGPU affects AMD GPUs labels Nov 12, 2023
@BenWibking
Copy link
Collaborator Author

@BenWibking
Copy link
Collaborator Author

AMReX bug report: AMReX-Codes/amrex#3623

@BenWibking BenWibking changed the title [HIP] memory corruption inside AMReX reported by ASAN on AMD GPU [HIP] memory corruption reported by ASAN on AMD GPU Nov 19, 2023
@BenWibking
Copy link
Collaborator Author

According to Weiqun, this is a false positive. However, the same issue appears at different places in the Microphysics unit tests. So it is likely that something wrong is happening, but ASAN is misdiagnosing it.

@BenWibking BenWibking changed the title [HIP] memory corruption reported by ASAN on AMD GPU [HIP] memory errors affecting Microphysics codes on AMD GPU Nov 19, 2023
@BenWibking
Copy link
Collaborator Author

This is fixed by compiling with -mllvm -amdgpu-function-calls=true.

This work around AMDGPU compiler bugs when building very large kernels (the primordial chemistry network, the larger nuclear networks).

We should have CMake add them automatically when compiling for AMDGPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMDGPU affects AMD GPUs bug Something isn't working Setonix
Projects
None yet
Development

No branches or pull requests

1 participant