amrex::Abort does not unconditionally abort on GPU #543

BenWibking · 2024-02-23T17:36:38Z

Describe the bug
We have been relying on amrex::Abort to unconditionally abort the code when called from GPU code. However, this only works when compiled in debug mode (or with additional compiler flags).

We need to rewrite the error handling code for iterative solves so that it correctly handles errors when -DCMAKE_BUILD_TYPE=Release is used (the default).

The Quokka documentation incorrectly describes amrex::Abort based on the old AMReX documentation (which was incorrect): https://quokka-astro.github.io/quokka/error_checking.html

To Reproduce
Steps to reproduce the behavior:

Compile the PopIII problem for GPU
It should crash with a VODE failure at timestep 58
Observe that it does not.

Additional context
The AMReX documentation was recently changed (October 2023) to reflect this new behavior, but this change was not communicated to the user community: AMReX-Codes/amrex#3605.

This affects both the Newton-Raphson solve used by the radiation code as well as the Microphysics chemistry integrator.
cc @psharda @chongchonghe

The text was updated successfully, but these errors were encountered:

chongchonghe · 2024-03-03T23:43:20Z

This is important to know. I always reply on AMREX_ASSERT, AMREX_ALWAYS_ASSERT, or AMREX_ALWAYS_ASSERT_WITH_MESSAGE. I guess they all reply on amrex::Abort, right? Then, what is the solution? How do we "rewrite the error handling code" so that the code handles errors in release build? According to the documentation, we should set USE_ASSERTION=TRUE at compile time to enable AMREX_ALWAYS_ASSERT in GPU release build.

BenWibking · 2024-03-04T02:48:03Z

This is important to know. I always reply on AMREX_ASSERT, AMREX_ALWAYS_ASSERT, or AMREX_ALWAYS_ASSERT_WITH_MESSAGE. I guess they all reply on amrex::Abort, right?

Yes.

Then, what is the solution? How do we "rewrite the error handling code" so that the code handles errors in release build?

You have to rewrite the function so it returns an error code that you then check. For an example, see the cooling code:

quokka/src/CloudyCooling.hpp

Line 278 in fcde1aa

nsubsteps(i, j, k) = nsteps;

quokka/src/CloudyCooling.hpp

Line 298 in fcde1aa

int nmax = nsubstepsMF.max(0);

You can then check the error condition in host code, and call amrex::Abort there. Or, you could return the failure condition to the calling function, and have it re-do the update with a smaller timestep. This is done in the hydro update.

BenWibking · 2024-03-04T02:53:33Z

Another example of how to do this is how Castro handles reactions:
https://github.com/AMReX-Astro/Castro/blob/1f3bfff9739227bb61f4ac0123653061df102a03/Source/reactions/Castro_react.cpp#L408

It updates an atomic variable for each failed cell, rather than updating cells in a separate MultiFab, so it is significantly more memory efficient. This implementation should be preferred. With this kind of implementation, you have to remember to update across all MPI ranks after the atomic update. In Castro, this step is done here:
https://github.com/AMReX-Astro/Castro/blob/1f3bfff9739227bb61f4ac0123653061df102a03/Source/reactions/Castro_react.cpp#L427

… chem and popiii problems) (#575) ### Description As described in #543 , `amrex::Abort()` does not unconditionally abort on GPUs. Following the workaround proposed in #543, I have implemented a way to circumvent this issue for problems using microphysics' `burn` (PrimordialChem and PopIIII). ### Related issues #543 ### Checklist _Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an `x` inside the square brackets `[ ]` in the Markdown source below:_ - [x] I have added a description (see above). - [x] I have added a link to any related issues see (see above). - [x] I have read the [Contributing Guide](https://github.com/quokka-astro/quokka/blob/development/CONTRIBUTING.md). - [ ] I have added tests for any new physics that this PR adds to the code. - [ ] I have tested this PR on my local computer and all tests pass. - [x] I have manually triggered the GPU tests with the magic comment `/azp run`. - [x] I have requested a reviewer for this PR. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piyush Sharda <psharda@RSAA-043527.local>

psharda · 2024-04-19T14:18:25Z

Fixed for PopIII by #575

BenWibking added bug Something isn't working documentation Improvements or additions to documentation priority:high high priority labels Feb 23, 2024

BenWibking assigned BenWibking and unassigned BenWibking Feb 27, 2024

psharda mentioned this issue Mar 18, 2024

Fix amrex::abort not aborting unconditionally on GPUs (for primordial chem and popiii problems) #575

Merged

7 tasks

BenWibking mentioned this issue Jun 10, 2024

set Newton iteration maxIter to 50 #643

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amrex::Abort does not unconditionally abort on GPU #543

amrex::Abort does not unconditionally abort on GPU #543

BenWibking commented Feb 23, 2024 •

edited

chongchonghe commented Mar 3, 2024

BenWibking commented Mar 4, 2024 •

edited

BenWibking commented Mar 4, 2024 •

edited

psharda commented Apr 19, 2024

amrex::Abort does not unconditionally abort on GPU #543

amrex::Abort does not unconditionally abort on GPU #543

Comments

BenWibking commented Feb 23, 2024 • edited

chongchonghe commented Mar 3, 2024

BenWibking commented Mar 4, 2024 • edited

BenWibking commented Mar 4, 2024 • edited

psharda commented Apr 19, 2024

BenWibking commented Feb 23, 2024 •

edited

BenWibking commented Mar 4, 2024 •

edited

BenWibking commented Mar 4, 2024 •

edited