[HIP] memory errors affecting Microphysics codes on AMD GPU #447

BenWibking · 2023-11-12T18:42:20Z

Describe the bug
AMD's AddressSanitizer reports a memory corruption bug in amrex::InitRandom(). It's currently unclear if this is a real bug or a false positive.

To Reproduce
Steps to reproduce the behavior:

Compile the HydroBlast3D problem with the moth-sanitizer.profile settings on Moth (add build profile for AMDGPU ASAN #446).
Run the problem.
See error from ASAN:

bwibking@moth:~/quokka/build> ./src/HydroBlast3D/test_hydro3d_blast ../tests/benchmark_unigrid_256.in
Initializing AMReX (23.10-23-g601cc4ee80e0)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing HIP...
HIP initialized with 1 device.
=================================================================
==1379366==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000326cfe8 at pc 0x7fc7796a5ea7 bp 0x7ffd90522b00 sp 0x7f
fd905222c0
READ of size 32 at 0x00000326cfe8 thread T0
    #0 0x7fc7796a5ea6 in __interceptor_memcpy (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6)
(BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a)
    #1 0x7fc7781440a9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3440a9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #2 0x7fc7781462f6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3462f6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #3 0x7fc7781465a6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3465a6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #4 0x7fc778112434  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x312434) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #5 0x7fc7780dcc53  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x2dcc53) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #6 0x7fc777f835e9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1835e9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #7 0x7fc777e89c0e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x89c0e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #8 0x7fc777fe650e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e650e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #9 0x7fc778010bd9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x210bd9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #10 0x7fc777fe6f91  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e6f91) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #11 0x7fc777ff13e7 in hipLaunchKernel (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1f13e7) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32
f3fbe143d)
    #12 0x2ae25ef in std::enable_if<MaybeDeviceRunnable<(anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)>::val
ue, void>::type amrex::ParallelFor<256, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(amrex::Gp
u::KernelInfo const&, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/quokka/extern/am
rex/Src/Base/AMReX_GpuLaunchFunctsG.H:878:5
    #13 0x2ae25ef in void amrex::ParallelFor<int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(int,
 (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaun
chFunctsG.H:1457:5
    #14 0x2ae25ef in (anonymous namespace)::ResizeRandomSeed(unsigned long) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX_Rando
m.cpp:54:5
    #15 0x2ae25ef in amrex::InitRandom(unsigned long, int, unsigned long) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX_Random.
cpp:104:5
    #16 0x2a21bf9 in amrex::Initialize(int&, char**&, bool, ompi_communicator_t*, std::function<void ()> const&, std::ostream&, std
::ostream&, void (*)(char const*)) /home/bwibking/quokka/extern/amrex/Src/Base/AMReX.cpp:618:5
    #17 0x29fc0cc in main /home/bwibking/quokka/src/main.cpp:22:2
    #18 0x7fc773c3feaf in __libc_start_call_main (/lib64/libc.so.6+0x3feaf) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #19 0x7fc773c3ff5f in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff5f) (BuildId: b39d468aead6d9ede227751ffe093da2874886
48)
    #20 0x2897644 in _start (/home/bwibking/quokka/build/src/HydroBlast3D/test_hydro3d_blast+0x2897644)

0x00000326cfe8 is located 56 bytes before global variable 'EOSData::mindens' defined in '/home/bwibking/quokka/extern/Microphysics/
interfaces/eos_data.cpp' (0x326d020) of size 8
0x00000326cfe8 is located 24 bytes before global variable 'EOSData::maxtemp' defined in '/home/bwibking/quokka/extern/Microphysics/
interfaces/eos_data.cpp' (0x326d000) of size 8
0x00000326cfe8 is located 0 bytes after global variable 'EOSData::mintemp' defined in '/home/bwibking/quokka/extern/Microphysics/in
terfaces/eos_data.cpp' (0x326cfe0) of size 8
SUMMARY: AddressSanitizer: global-buffer-overflow (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5e
a6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a) in __interceptor_memcpy
Shadow bytes around the buggy address:
Shadow bytes around the buggy address:
  0x00000326cd00: 00 00 00 00 f9 f9 f9 f9 01 f9 f9 f9 00 00 00 00
  0x00000326cd80: f9 f9 f9 f9 01 f9 f9 f9 01 f9 f9 f9 00 f9 f9 f9
  0x00000326ce00: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326ce80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326cf00: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
=>0x00000326cf80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00[f9]f9 f9
  0x00000326d000: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326d080: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326d100: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x00000326d180: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 00 00 00
  0x00000326d200: 00 00 00 00 01 f9 f9 f9 00 f9 f9 f9 04 f9 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1379366==ABORTING

Additional context
Although the test suite passes on Moth, we have seen bizarre and incorrect behavior of Quokka when running production simulations on AMD GPUs. We want to rule out whether this is the cause.

The text was updated successfully, but these errors were encountered:

BenWibking · 2023-11-12T18:52:23Z

The buffer overflow is reported from inside this kernel:
https://github.com/AMReX-Codes/amrex/blob/d36463103daed09a40cdea235041a6ab79ff280c/Src/Base/AMReX_Random.cpp#L54

BenWibking · 2023-11-13T13:32:45Z

AMReX bug report: AMReX-Codes/amrex#3623

BenWibking · 2023-11-19T14:14:00Z

According to Weiqun, this is a false positive. However, the same issue appears at different places in the Microphysics unit tests. So it is likely that something wrong is happening, but ASAN is misdiagnosing it.

BenWibking · 2024-02-22T15:53:21Z

This is fixed by compiling with -mllvm -amdgpu-function-calls=true.

This work around AMDGPU compiler bugs when building very large kernels (the primordial chemistry network, the larger nuclear networks).

We should have CMake add them automatically when compiling for AMDGPU.

BenWibking added bug Something isn't working Setonix AMDGPU affects AMD GPUs labels Nov 12, 2023

BenWibking mentioned this issue Nov 14, 2023

ROCm memory issues affecting Microphysics codes AMReX-Astro/Microphysics#1386

Closed

BenWibking changed the title ~~[HIP] memory corruption inside AMReX reported by ASAN on AMD GPU~~ [HIP] memory corruption reported by ASAN on AMD GPU Nov 19, 2023

BenWibking changed the title ~~[HIP] memory corruption reported by ASAN on AMD GPU~~ [HIP] memory errors affecting Microphysics codes on AMD GPU Nov 19, 2023

psharda mentioned this issue Feb 16, 2024

Increase max timesteps to 300 for the PopIII Test #539

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIP] memory errors affecting Microphysics codes on AMD GPU #447

[HIP] memory errors affecting Microphysics codes on AMD GPU #447

BenWibking commented Nov 12, 2023

BenWibking commented Nov 12, 2023

BenWibking commented Nov 13, 2023

BenWibking commented Nov 19, 2023

BenWibking commented Feb 22, 2024

[HIP] memory errors affecting Microphysics codes on AMD GPU #447

[HIP] memory errors affecting Microphysics codes on AMD GPU #447

Comments

BenWibking commented Nov 12, 2023

BenWibking commented Nov 12, 2023

BenWibking commented Nov 13, 2023

BenWibking commented Nov 19, 2023

BenWibking commented Feb 22, 2024