adap branch UCX error #1156

timjim333 · 2021-01-06T09:50:27Z

Hi, I'm opening a new thread since it seems that this issue isn't directly related to the AMG mesh refinement itself, but feel free to close or move this to a more appropriate place @pcarruscag

I'm having an issue when running SU2_CFD in the feature_adap branch (so this means that it also fails when trying to run the mesh refinement script). It seems to run fine for the TestCase/euler/naca0012 but when I try it on my mesh I get a UCX ERROR.

On running: mpirun -n 40 --use-hwthread-cpus /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg, I seem to get variations on this message in my screen output:

|          49|   -2.095057|    0.015781|    0.001431|    0.000000|  9.1667e+04|
|          50|   -2.140503|    0.015781|    0.001431|    0.000000|  9.1667e+04|
+-----------------------------------------------------------------------+
|        File Writing Summary       |              Filename             |
+-----------------------------------------------------------------------+
|SU2 restart                        |restart_flow.dat                   |
|Paraview binary                    |flow.vtk                           |
|Paraview binary surface            |surface_flow.vtk                   |
[1609922278.175246] [super:1134625:0]           sock.c:344  UCX  ERROR recv(fd=56) failed: Bad address
[1609922278.175301] [super:1134625:0]           sock.c:344  UCX  ERROR recv(fd=54) failed: Connection reset by peer
[1609922278.175551] [super:1134625:0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor

SU2_CFD: ../externals/parmetis/libparmetis/match.c:243: libparmetis__Match_Global: Assertion `k >= firstvtx && k < lastvtx' failed.
[super:1134138] *** Process received signal ***
[super:1134138] Signal: Aborted (6)
[super:1134138] Signal code:  (-6)
[super:1134138] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fb93d021b20]
[super:1134138] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb93c1507ff]
[super:1134138] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fb93c13ac35]
[super:1134138] [ 3] /lib64/libc.so.6(+0x21b09)[0x7fb93c13ab09]
[super:1134138] [ 4] /lib64/libc.so.6(+0x2fde6)[0x7fb93c148de6]
[super:1134138] [ 5] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a9be03]
[super:1134138] [ 6] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a94e76]
[super:1134138] [ 7] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a9590d]
[super:1134138] [ 8] /opt/su2/SU2v7_adap/bin/SU2_CFD[0xabb1bb]
[super:1134138] [ 9] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7ddf6b]
[super:1134138] [10] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7ded07]
[super:1134138] [11] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7df356]
[super:1134138] [12] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7e445f]
[super:1134138] [13] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x45ba61]
[super:1134138] [14] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb93c13c7b3]
[super:1134138] [15] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x47216e]
[super:1134138] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 38 with PID 0 on node super exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Sometimes it hangs to the UCX ERROR lines straight after Building the graph adjacency structure. in Geometry Preprocessing and other times, it seems to run fine for the first batch of iterations until it hits the first solution file writing iteration (as set by OUTPUT_WRT_FREQ) show in the above output snip.

Do you have any hints on how to debug this or what might be causing this? Thanks.

To Reproduce
I've attached the mesh and config file in this link.

Desktop (please complete the following information):

OS: Linux CentOS 8
C++ compiler and version: GCC 8.22
MPI implementation and version: Intel (Open MPI) 3.0.0
SU2 Version: v7.0.3 (feature_adap branch)

The text was updated successfully, but these errors were encountered:

pcarruscag · 2021-01-06T10:14:07Z

I think that can only mean the mesh is corrupted, which is causing memory errors within parmetis.
Memory errors can take some time to manifest, especially in small cases.
If the case is small you can try running the serial version to see if the problem only occurs in parallel, as for what might be the rootcause of the bad mesh I have no idea.

timjim333 · 2021-01-06T11:01:13Z

Hi @pcarruscag thanks for the reply. So you think it might be a mesh issue? It might well be possible, as I previously had a structured collar mesh around an unstructured core for supersonic evaluation but I tried to diagonalize the collar mesh as it seemed that AMG refinement only works for triangles and tetrahedrons... I could have made a mistake in this step! I'll take another look. Cheers.

vdweide · 2021-01-06T12:48:11Z

Can you run it with valgrind to check if there is a memory issue? Compile with -g. Also, does the problem persist if you reduce the number of MPI ranks?

timjim333 · 2021-01-06T13:46:21Z

Hi @vdweide can I just double-check what I should try? Compiling SU2 with -g or valgrind? Thanks!

vdweide · 2021-01-06T13:51:53Z

Compile with -g or when using meson just add --buildtype=debug to the arguments to build the executable. Then run it as follows

mpirun -np 40 valgrind SU2_CFD case.cfg.

The probably get quite a few false warnings from MPI, but you can filter those out. Try to reduce the number of ranks, if possible.

timjim333 · 2021-01-06T14:07:49Z

Ok, I've recompiled using --buildtype=debug and I'm running valgrind now. I'll try and run it at a reduced rank and get back to you. Thanks.

timjim333 · 2021-01-06T14:36:16Z

@vdweide I've attached the SU2 output and the valgrind output running on 2 processes, i.e.: mpirun -n 2 --use-hwthread-cpus valgrind /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg
su2_out_2.txt
valgrind_out_2.txt

I also tried with 30 processes but valgrind gave up after stating that there were too many errors.

Sorry, I'm not so familiar with what to look out for. I'm guessing that something showing in the leak summary is a bad thing? Thanks

timjim333 · 2021-01-06T15:30:50Z

In case it helps, I also ran valgrind using --leak-check=full and --track-origins=yes. I've attached the outputs here.
valgrind_out_2_leakcheck.txt
valgrind_out_2_origins.txt

vdweide · 2021-01-06T15:32:27Z

No, the invalid reads and writes are problematic. There you cross the boundaries of allocated memory and anything can happen.
What version/branch are you using? The line numbers valgrind gives do not correspond to the current develop version.

timjim333 · 2021-01-06T15:52:03Z

I'm using the 'feature_adap' branch. At least, I believe I am; I pulled the repo in this manner:

git clone https://github.com/su2code/SU2.git SU2_src
cd SU2_src
git checkout feature_adap

As far as I can tell, it's v7.0.3.

vdweide · 2021-01-06T21:36:38Z

That's indeed how you get the feature_adap branch.
Is it possible to merge this branch with the latest version of develop first?

timjim333 · 2021-01-07T05:07:15Z

I had a quick look at the merging process and it seems like quite a few files conflict. I'm not sure which files I can merge from develop and not accidentally break the feature_adap functionality. Can I more or less pull across most of these changes? I can give it a go if you can give me some pointers but I'm not well-versed in cpp! Thanks.

Common/include/CConfig.hpp
Common/include/adt/CADTElemClass.hpp
Common/include/geometry/dual_grid/CEdge.hpp
Common/include/geometry/dual_grid/CPoint.hpp
Common/include/geometry/dual_grid/CVertex.hpp
Common/include/option_structure.hpp
Common/src/adt/CADTElemClass.cpp
Common/src/geometry/CPhysicalGeometry.cpp
Common/src/geometry/dual_grid/CPoint.cpp
SU2_CFD/include/output/COutputLegacy.hpp
SU2_CFD/include/solvers/CEulerSolver.hpp
SU2_CFD/include/solvers/CSolver.hpp
SU2_CFD/src/iteration_structure.cpp
SU2_CFD/src/numerics/flow/flow_diffusion.cpp
SU2_CFD/src/output/CFlowCompOutput.cpp
SU2_CFD/src/output/output_structure_legacy.cpp
SU2_CFD/src/solvers/CEulerSolver.cpp
SU2_CFD/src/solvers/CNSSolver.cpp
SU2_CFD/src/solvers/CSolver.cpp
SU2_CFD/src/solvers/CTurbSASolver.cpp
SU2_CFD/src/solvers/CTurbSSTSolver.cpp
SU2_CFD/src/solvers/CTurbSolver.cpp
SU2_CFD/src/variables/CEulerVariable.cpp
SU2_DOT/src/meson.build
SU2_IDE/Xcode/SU2_CFD.xcodeproj/project.pbxproj
SU2_PY/pySU2/pySU2.i
SU2_PY/pySU2/pySU2ad.i
meson_scripts/init.py
preconfigure.py

vdweide · 2021-01-07T06:30:10Z

No, you cannot just do that. Somebody who worked on feature_adap should have a look at it. @bmunguia, it looks like you made the latest commit to this branch, but that is already quite some time ago (May 2020). What is the current status and do you plan to merge with the latest version of develop?

timjim333 · 2021-01-07T11:22:27Z

I see, I hope that @bmunguia will have a chance to take a look! I tried to have a look through the past commits but I didn't manage to successfully merge all the functions. From what I can tell, these are the edited variables/functions:

CConfig Class

WRT_SLICE
GetBool_Compute_Metric
GetWrt_Aniso_Sensor
GetKind_Aniso_Sensor
GetKind_Hessian_Method
GetAdap_Norm
GetAdap_Hmax
GetAdap_Hmin
GetAdap_ARmax
GetAdap_Complexity

Cvertex (not sure if values should be initialised)

~CVertex
GetnDonorPoints
SetDonorCoeff
GetDonorCoeff
SetInterpDonorPoint
GetInterpDonorPoint
SetInterpDonorProcessor
GetInterpDonorProcessor
Allocate_DonorInfo
GetVarRot
SetVarRot

Option_structure

ENUM_ANISO_SENSOR
MapType
ENUM_OUTPUT
MakePair("INRIA", INRIA)
MPI_QUANTIFIES enums

CPhysicalGeometry - probably AMG stuff?

CPhysicalGeometry::LoadAdaptedMeshParallel_FVM
CPhysicalGeometry::Check_IntElem_Orientation
CPhysicalGeometry::Check_BoundElem_Orientation

Common/src/geometry/dual_grid/CPoint.cpp'

Check this one!

COutputLegacy.hpp

Import inria amg

output_structure_legacy.cpp'

SpecialOutput - Inria methods?

Csolver

CSolver::SetPositiveDefiniteHessian

meson.build

Add 'output/filewriter/CInriaFileWriter.cpp',

Init.py - add amgio stuff

Add sha_version_amg
Add github_repo_amg

Preconfigure.py

Add init_inria
Add other inria flags

timjim333 · 2021-01-08T16:30:23Z

I'm unsure if the AMG version uses its own implementation of vertices etc. or if these happen to be the way that they were implemented in older versions of SU2.

pcarruscag · 2021-01-08T16:57:06Z

Most likely a mixture of those 2 things, but it should not be too difficult to fix.

timjim333 · 2021-01-09T04:36:24Z

@pcarruscag Could help me take a look through or give me some pointers on where to start? I've not programmed in C++ before (maybe a good time to start with lockdown...) but if it's the case of figuring out how to merge already working code, I might be able to hack together something. To be honest, though, it might be better/faster for someone who actually knows what they're doing to do so!

I wanted to use this functionality as part of another project, so I'm just wary of breaking something not obvious in the background.

pcarruscag · 2021-01-09T09:38:33Z

I could but I do not think updating that branch will fix your problem. We have not found any mesh handling bugs recently.
Creating / modifying meshes manually can get tricky (at least in my experience).
Have you tried simpler problems? Try starting with a problem that is known to work (there is a long issue with success stories, do a search for mesh adaptation here on github). Then build up from it, e.g. take the same problem and use a finer grid, change the physics to what you need, use a grid for your problem (ideally change one thing at a time).
Also keep in mind that if that branch was finished work it would probably have been merged into develop by now...

timjim333 · 2021-01-09T13:36:26Z

@pcarruscag I see, sorry I didn't realise that it still might be a mesh problem - I thought it was a memory issue from the error messages! Ok, I'll give it another try from scratch. If I understand correctly, amg only works with triangles and tetrahedrons, not pyramids or quads, is that right? Thanks again.

timjim333 · 2021-01-13T17:39:03Z

Hi @pcarruscag I just tried a simpler mesh and using MPI I get the UCX crash.
err_log_SU2v7.0.3.txt

To double check, I also used the master v7.0.8 SU2_CFD. When I run with MPI, I get the UCX error but when I run in serial, the solution appears to converge fine. I suspect that this means it's probably not the mesh that is causing the issues - what are your thoughts?
su2_out_serial.txt

pcarruscag · 2021-01-13T17:44:56Z

I looked for "UCX error" and got e.g. this openucx/ucx#4742
IDK but it looks like an MPI configuration problem...

timjim333 · 2021-01-13T19:55:19Z

Interesting - my MPI is straight from the CentOS repo, so I didn't expect it to be the issue but I'll try to compile another version just to check.

timjim333 · 2021-01-17T07:43:35Z

After pulling in the latest OpenMPI v3 (3.1.6) and recompiling the mpi4py and SU2 branch, this error seems to have gone away! Thank you for your help @pcarruscag @vdweide

timjim333 added the bug label Jan 6, 2021

timjim333 closed this as completed Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adap branch UCX error #1156

adap branch UCX error #1156

timjim333 commented Jan 6, 2021

pcarruscag commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 6, 2021

timjim333 commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 7, 2021

vdweide commented Jan 7, 2021

timjim333 commented Jan 7, 2021

timjim333 commented Jan 8, 2021

pcarruscag commented Jan 8, 2021

timjim333 commented Jan 9, 2021

pcarruscag commented Jan 9, 2021

timjim333 commented Jan 9, 2021

timjim333 commented Jan 13, 2021

pcarruscag commented Jan 13, 2021

timjim333 commented Jan 13, 2021

timjim333 commented Jan 17, 2021

adap branch UCX error #1156

adap branch UCX error #1156

Comments

timjim333 commented Jan 6, 2021

pcarruscag commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 6, 2021

timjim333 commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 6, 2021

vdweide commented Jan 6, 2021

timjim333 commented Jan 7, 2021

vdweide commented Jan 7, 2021

timjim333 commented Jan 7, 2021

timjim333 commented Jan 8, 2021

pcarruscag commented Jan 8, 2021

timjim333 commented Jan 9, 2021

pcarruscag commented Jan 9, 2021

timjim333 commented Jan 9, 2021

timjim333 commented Jan 13, 2021

pcarruscag commented Jan 13, 2021

timjim333 commented Jan 13, 2021

timjim333 commented Jan 17, 2021