Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amrex::Abort does not unconditionally abort on GPU #543

Open
BenWibking opened this issue Feb 23, 2024 · 4 comments
Open

amrex::Abort does not unconditionally abort on GPU #543

BenWibking opened this issue Feb 23, 2024 · 4 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation priority:high high priority

Comments

@BenWibking
Copy link
Collaborator

BenWibking commented Feb 23, 2024

Describe the bug
We have been relying on amrex::Abort to unconditionally abort the code when called from GPU code. However, this only works when compiled in debug mode (or with additional compiler flags).

We need to rewrite the error handling code for iterative solves so that it correctly handles errors when -DCMAKE_BUILD_TYPE=Release is used (the default).

The Quokka documentation incorrectly describes amrex::Abort based on the old AMReX documentation (which was incorrect): https://quokka-astro.github.io/quokka/error_checking.html

To Reproduce
Steps to reproduce the behavior:

  1. Compile the PopIII problem for GPU
  2. It should crash with a VODE failure at timestep 58
  3. Observe that it does not.

Additional context
The AMReX documentation was recently changed (October 2023) to reflect this new behavior, but this change was not communicated to the user community: AMReX-Codes/amrex#3605.

This affects both the Newton-Raphson solve used by the radiation code as well as the Microphysics chemistry integrator.
cc @psharda @chongchonghe

@BenWibking BenWibking added bug Something isn't working documentation Improvements or additions to documentation priority:high high priority labels Feb 23, 2024
@BenWibking BenWibking assigned BenWibking and unassigned BenWibking Feb 27, 2024
@chongchonghe
Copy link
Contributor

This is important to know. I always reply on AMREX_ASSERT, AMREX_ALWAYS_ASSERT, or AMREX_ALWAYS_ASSERT_WITH_MESSAGE. I guess they all reply on amrex::Abort, right? Then, what is the solution? How do we "rewrite the error handling code" so that the code handles errors in release build? According to the documentation, we should set USE_ASSERTION=TRUE at compile time to enable AMREX_ALWAYS_ASSERT in GPU release build.

@BenWibking
Copy link
Collaborator Author

BenWibking commented Mar 4, 2024

This is important to know. I always reply on AMREX_ASSERT, AMREX_ALWAYS_ASSERT, or AMREX_ALWAYS_ASSERT_WITH_MESSAGE. I guess they all reply on amrex::Abort, right?

Yes.

Then, what is the solution? How do we "rewrite the error handling code" so that the code handles errors in release build?

You have to rewrite the function so it returns an error code that you then check. For an example, see the cooling code:

nsubsteps(i, j, k) = nsteps;

int nmax = nsubstepsMF.max(0);

You can then check the error condition in host code, and call amrex::Abort there. Or, you could return the failure condition to the calling function, and have it re-do the update with a smaller timestep. This is done in the hydro update.

@BenWibking
Copy link
Collaborator Author

BenWibking commented Mar 4, 2024

Another example of how to do this is how Castro handles reactions:
https://github.com/AMReX-Astro/Castro/blob/1f3bfff9739227bb61f4ac0123653061df102a03/Source/reactions/Castro_react.cpp#L408

It updates an atomic variable for each failed cell, rather than updating cells in a separate MultiFab, so it is significantly more memory efficient. This implementation should be preferred. With this kind of implementation, you have to remember to update across all MPI ranks after the atomic update. In Castro, this step is done here:
https://github.com/AMReX-Astro/Castro/blob/1f3bfff9739227bb61f4ac0123653061df102a03/Source/reactions/Castro_react.cpp#L427

github-merge-queue bot pushed a commit that referenced this issue Mar 22, 2024
… chem and popiii problems) (#575)

### Description
As described in #543 , `amrex::Abort()` does not unconditionally abort
on GPUs. Following the workaround proposed in #543, I have implemented a
way to circumvent this issue for problems using microphysics' `burn`
(PrimordialChem and PopIIII).

### Related issues
#543 

### Checklist
_Before this pull request can be reviewed, all of these tasks should be
completed. Denote completed tasks with an `x` inside the square brackets
`[ ]` in the Markdown source below:_
- [x] I have added a description (see above).
- [x] I have added a link to any related issues see (see above).
- [x] I have read the [Contributing
Guide](https://github.com/quokka-astro/quokka/blob/development/CONTRIBUTING.md).
- [ ] I have added tests for any new physics that this PR adds to the
code.
- [ ] I have tested this PR on my local computer and all tests pass.
- [x] I have manually triggered the GPU tests with the magic comment
`/azp run`.
- [x] I have requested a reviewer for this PR.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piyush Sharda <psharda@RSAA-043527.local>
@psharda
Copy link
Contributor

psharda commented Apr 19, 2024

Fixed for PopIII by #575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation priority:high high priority
Projects
None yet
Development

No branches or pull requests

3 participants