New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adap branch UCX error #1156
Comments
I think that can only mean the mesh is corrupted, which is causing memory errors within parmetis. |
Hi @pcarruscag thanks for the reply. So you think it might be a mesh issue? It might well be possible, as I previously had a structured collar mesh around an unstructured core for supersonic evaluation but I tried to diagonalize the collar mesh as it seemed that AMG refinement only works for triangles and tetrahedrons... I could have made a mistake in this step! I'll take another look. Cheers. |
Can you run it with valgrind to check if there is a memory issue? Compile with -g. Also, does the problem persist if you reduce the number of MPI ranks? |
Hi @vdweide can I just double-check what I should try? Compiling SU2 with |
Compile with -g or when using meson just add --buildtype=debug to the arguments to build the executable. Then run it as follows mpirun -np 40 valgrind SU2_CFD case.cfg. The probably get quite a few false warnings from MPI, but you can filter those out. Try to reduce the number of ranks, if possible. |
Ok, I've recompiled using --buildtype=debug and I'm running valgrind now. I'll try and run it at a reduced rank and get back to you. Thanks. |
@vdweide I've attached the SU2 output and the valgrind output running on 2 processes, i.e.: I also tried with 30 processes but valgrind gave up after stating that there were too many errors. Sorry, I'm not so familiar with what to look out for. I'm guessing that something showing in the leak summary is a bad thing? Thanks |
In case it helps, I also ran valgrind using |
No, the invalid reads and writes are problematic. There you cross the boundaries of allocated memory and anything can happen. |
I'm using the 'feature_adap' branch. At least, I believe I am; I pulled the repo in this manner:
As far as I can tell, it's v7.0.3. |
That's indeed how you get the feature_adap branch. |
I had a quick look at the merging process and it seems like quite a few files conflict. I'm not sure which files I can merge from develop and not accidentally break the feature_adap functionality. Can I more or less pull across most of these changes? I can give it a go if you can give me some pointers but I'm not well-versed in cpp! Thanks.
|
No, you cannot just do that. Somebody who worked on feature_adap should have a look at it. @bmunguia, it looks like you made the latest commit to this branch, but that is already quite some time ago (May 2020). What is the current status and do you plan to merge with the latest version of develop? |
I see, I hope that @bmunguia will have a chance to take a look! I tried to have a look through the past commits but I didn't manage to successfully merge all the functions. From what I can tell, these are the edited variables/functions: CConfig Class
Cvertex (not sure if values should be initialised)
Option_structure
CPhysicalGeometry - probably AMG stuff?
Common/src/geometry/dual_grid/CPoint.cpp'
COutputLegacy.hpp
output_structure_legacy.cpp'
Csolver
meson.build
Init.py - add amgio stuff
Preconfigure.py
|
I'm unsure if the AMG version uses its own implementation of vertices etc. or if these happen to be the way that they were implemented in older versions of SU2. |
Most likely a mixture of those 2 things, but it should not be too difficult to fix. |
@pcarruscag Could help me take a look through or give me some pointers on where to start? I've not programmed in C++ before (maybe a good time to start with lockdown...) but if it's the case of figuring out how to merge already working code, I might be able to hack together something. To be honest, though, it might be better/faster for someone who actually knows what they're doing to do so! I wanted to use this functionality as part of another project, so I'm just wary of breaking something not obvious in the background. |
I could but I do not think updating that branch will fix your problem. We have not found any mesh handling bugs recently. |
@pcarruscag I see, sorry I didn't realise that it still might be a mesh problem - I thought it was a memory issue from the error messages! Ok, I'll give it another try from scratch. If I understand correctly, amg only works with triangles and tetrahedrons, not pyramids or quads, is that right? Thanks again. |
Hi @pcarruscag I just tried a simpler mesh and using MPI I get the UCX crash. To double check, I also used the master v7.0.8 SU2_CFD. When I run with MPI, I get the UCX error but when I run in serial, the solution appears to converge fine. I suspect that this means it's probably not the mesh that is causing the issues - what are your thoughts? |
I looked for "UCX error" and got e.g. this openucx/ucx#4742 |
Interesting - my MPI is straight from the CentOS repo, so I didn't expect it to be the issue but I'll try to compile another version just to check. |
After pulling in the latest OpenMPI v3 (3.1.6) and recompiling the mpi4py and SU2 branch, this error seems to have gone away! Thank you for your help @pcarruscag @vdweide |
Hi, I'm opening a new thread since it seems that this issue isn't directly related to the AMG mesh refinement itself, but feel free to close or move this to a more appropriate place @pcarruscag
I'm having an issue when running
SU2_CFD
in thefeature_adap
branch (so this means that it also fails when trying to run the mesh refinement script). It seems to run fine for theTestCase/euler/naca0012
but when I try it on my mesh I get aUCX ERROR
.On running:
mpirun -n 40 --use-hwthread-cpus /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg
, I seem to get variations on this message in my screen output:Sometimes it hangs to the
UCX ERROR
lines straight afterBuilding the graph adjacency structure.
in Geometry Preprocessing and other times, it seems to run fine for the first batch of iterations until it hits the first solution file writing iteration (as set byOUTPUT_WRT_FREQ
) show in the above output snip.Do you have any hints on how to debug this or what might be causing this? Thanks.
To Reproduce
I've attached the mesh and config file in this link.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: