Skip to content

p8est needs two balance calls in function balance!#2289

Open
amrueda wants to merge 10 commits intomainfrom
arr/p8est_balance
Open

p8est needs two balance calls in function balance!#2289
amrueda wants to merge 10 commits intomainfrom
arr/p8est_balance

Conversation

@amrueda
Copy link
Contributor

@amrueda amrueda commented Feb 17, 2025

As in 2D P4est meshes, p8est_balance sometimes needs to be called twice to ensure the mesh is properly balanced. This issue is related to the P4est bug reported in p4est issue #112.

If the second balancing step is skipped, some simulations may crash during the coarsening operation. I identified a minimal working example by modifying the elixir examples/p4est_3d_dgsem/elixir_advection_amr.jl, which consistently crashes without the fix introduced in this PR.

Issue Breakdown for elixir_advection_amr.jl

  • After time step 160, some cells are marked for refinement. The refinement and an initial balancing step are performed.
  • Next, coarsen!(mesh) is called (see code below). The coarsening operation (line 2319) is done without any problems.
  • In the subsequent balance!() call (line 2335), some unfinished changes from the previous balancing (the one done after refinement) step take effect.
  • This leads to a crash when the code attempts to locate a cell that was modified during balancing and also marked for coarsening (line 2346). Since no valid cell is found, nothing is returned.
  • The crash occurs at line 2349, where nothing is mistakenly used as an integer index.

function coarsen!(mesh::P4estMesh)
# Copy original element IDs to quad user data storage
original_n_cells = ncells(mesh)
save_original_ids(mesh)
# Coarsen marked cells
coarsen_fn_c = cfunction(coarsen_fn, Val(ndims(mesh)))
init_fn_c = cfunction(init_fn, Val(ndims(mesh)))
@trixi_timeit timer() "coarsen!" coarsen_p4est!(mesh.p4est, false, coarsen_fn_c,
init_fn_c)
# IDs of newly created cells (one-based)
new_cells = collect_new_cells(mesh)
# Old IDs of cells that have been coarsened (one-based)
coarsened_cells_vec = collect_changed_cells(mesh, original_n_cells)
# 2^ndims changed cells should have been coarsened to one new cell.
# This matrix will store the IDs of all cells that have been coarsened to cell new_cells[i]
# in the i-th column.
coarsened_cells = reshape(coarsened_cells_vec, 2^ndims(mesh), length(new_cells))
# Save new original IDs to find out what changed after balancing
intermediate_n_cells = ncells(mesh)
save_original_ids(mesh)
@trixi_timeit timer() "rebalance" balance!(mesh, init_fn_c)
refined_cells = collect_changed_cells(mesh, intermediate_n_cells)
# Some cells may have been coarsened even though they unbalanced the forest.
# These cells have now been refined again by p4est_balance.
# refined_cells contains the intermediate IDs (ID of coarse cell
# between coarsening and balancing) of these cells.
# Find original ID of each cell that has been coarsened and then refined again.
for refined_cell in refined_cells
# i-th cell of the ones that have been created by coarsening has been refined again
i = findfirst(==(refined_cell), new_cells)
# Remove IDs of the 2^ndims cells that have been coarsened to this cell
coarsened_cells[:, i] .= -1
end
# Return all IDs of cells that have been coarsened but not refined again by balancing
return coarsened_cells_vec[coarsened_cells_vec .>= 0]
end

@amrueda amrueda marked this pull request as draft February 17, 2025 11:10
@github-actions
Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@codecov
Copy link

codecov bot commented Feb 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.81%. Comparing base (dc2966b) to head (69ac521).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2289      +/-   ##
==========================================
- Coverage   97.00%   95.81%   -1.18%     
==========================================
  Files         555      555              
  Lines       43723    43723              
==========================================
- Hits        42410    41893     -517     
- Misses       1313     1830     +517     
Flag Coverage Δ
unittests 95.81% <100.00%> (-1.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@amrueda amrueda changed the title WIP: p8est needs two balance calls in function balance! p8est needs two balance calls in function balance! Feb 17, 2025
@amrueda amrueda marked this pull request as ready for review February 17, 2025 14:03
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something weird is going on with the MPI tests in this PR but not on main. Could you please check what is going on?

@Arpit-Babbar
Copy link
Member

MPI applications being stuck waiting for other ranks is an MPI deadlock. I guess it can happen if one MPI rank goes to the second balance! call before all ranks finish the first, although P4est should not allow that.

If that is what is causing it, then putting an MPI.WaitAll! should fix it.

What can also cause it is if the balance! function is just not able to succeed and is thus stuck.

What is strange is that the currently running CI seems to be stuck at the end of the simulation, in fact in printing summary callback; see the screenshots below.

image
image

These screenshots are contradictory to my guess about the deadlock, but I thought I would mention it here anyway.

@sloede
Copy link
Member

sloede commented Feb 19, 2025

What can also cause it is if the balance! function is just not able to succeed and is thus stuck.

I think given p4est's maturity, this is not a very likely scenario unless you are trying to process a very weird mesh

@DanielDoehring DanielDoehring added the bug Something isn't working label Feb 20, 2025
@cburstedde
Copy link

It is possible that the reported behavior is not a bug of p4est but rather a consequence of an incompletely specified connectivity. Copying the latest convention from the quoted p4est issue:

Connectivity completeness: If a 3D connectivity contains natural connections between trees that are edge neighbors without being face neighbors, these edges shall be encoded explicitly in the connectivity structure. If a connectivity implies natural connections between trees that are corner neighbors without being face (or edge) neighbors, these corners shall be encoded explicitly.

Are we certain that the connectivity is constructed according to these rules? Unfortunately, we do not have a function in p4est that would provide a simple answer to this question.

@sloede
Copy link
Member

sloede commented Mar 25, 2025

Are we certain that the connectivity is constructed according to these rules?

@Arpit-Babbar @amrueda Can you confirm this?

@amrueda
Copy link
Contributor Author

amrueda commented Mar 25, 2025

It is possible that the reported behavior is not a bug of p4est but rather a consequence of an incompletely specified connectivity. Copying the latest convention from the quoted p4est issue:

Connectivity completeness: If a 3D connectivity contains natural connections between trees that are edge neighbors without being face neighbors, these edges shall be encoded explicitly in the connectivity structure. If a connectivity implies natural connections between trees that are corner neighbors without being face (or edge) neighbors, these corners shall be encoded explicitly.

Are we certain that the connectivity is constructed according to these rules? Unfortunately, we do not have a function in p4est that would provide a simple answer to this question.

Many thanks for pointing this out, @cburstedde!
We will check this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants