Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adap branch UCX error #1156

Closed
timjim333 opened this issue Jan 6, 2021 · 23 comments
Closed

adap branch UCX error #1156

timjim333 opened this issue Jan 6, 2021 · 23 comments
Labels

Comments

@timjim333
Copy link

Hi, I'm opening a new thread since it seems that this issue isn't directly related to the AMG mesh refinement itself, but feel free to close or move this to a more appropriate place @pcarruscag

I'm having an issue when running SU2_CFD in the feature_adap branch (so this means that it also fails when trying to run the mesh refinement script). It seems to run fine for the TestCase/euler/naca0012 but when I try it on my mesh I get a UCX ERROR.

On running: mpirun -n 40 --use-hwthread-cpus /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg, I seem to get variations on this message in my screen output:

|          49|   -2.095057|    0.015781|    0.001431|    0.000000|  9.1667e+04|
|          50|   -2.140503|    0.015781|    0.001431|    0.000000|  9.1667e+04|
+-----------------------------------------------------------------------+
|        File Writing Summary       |              Filename             |
+-----------------------------------------------------------------------+
|SU2 restart                        |restart_flow.dat                   |
|Paraview binary                    |flow.vtk                           |
|Paraview binary surface            |surface_flow.vtk                   |
[1609922278.175246] [super:1134625:0]           sock.c:344  UCX  ERROR recv(fd=56) failed: Bad address
[1609922278.175301] [super:1134625:0]           sock.c:344  UCX  ERROR recv(fd=54) failed: Connection reset by peer
[1609922278.175551] [super:1134625:0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor

SU2_CFD: ../externals/parmetis/libparmetis/match.c:243: libparmetis__Match_Global: Assertion `k >= firstvtx && k < lastvtx' failed.
[super:1134138] *** Process received signal ***
[super:1134138] Signal: Aborted (6)
[super:1134138] Signal code:  (-6)
[super:1134138] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fb93d021b20]
[super:1134138] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb93c1507ff]
[super:1134138] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fb93c13ac35]
[super:1134138] [ 3] /lib64/libc.so.6(+0x21b09)[0x7fb93c13ab09]
[super:1134138] [ 4] /lib64/libc.so.6(+0x2fde6)[0x7fb93c148de6]
[super:1134138] [ 5] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a9be03]
[super:1134138] [ 6] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a94e76]
[super:1134138] [ 7] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a9590d]
[super:1134138] [ 8] /opt/su2/SU2v7_adap/bin/SU2_CFD[0xabb1bb]
[super:1134138] [ 9] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7ddf6b]
[super:1134138] [10] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7ded07]
[super:1134138] [11] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7df356]
[super:1134138] [12] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7e445f]
[super:1134138] [13] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x45ba61]
[super:1134138] [14] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb93c13c7b3]
[super:1134138] [15] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x47216e]
[super:1134138] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 38 with PID 0 on node super exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Sometimes it hangs to the UCX ERROR lines straight after Building the graph adjacency structure. in Geometry Preprocessing and other times, it seems to run fine for the first batch of iterations until it hits the first solution file writing iteration (as set by OUTPUT_WRT_FREQ) show in the above output snip.

Do you have any hints on how to debug this or what might be causing this? Thanks.

To Reproduce
I've attached the mesh and config file in this link.

Desktop (please complete the following information):

  • OS: Linux CentOS 8
  • C++ compiler and version: GCC 8.22
  • MPI implementation and version: Intel (Open MPI) 3.0.0
  • SU2 Version: v7.0.3 (feature_adap branch)
@timjim333 timjim333 added the bug label Jan 6, 2021
@pcarruscag
Copy link
Member

I think that can only mean the mesh is corrupted, which is causing memory errors within parmetis.
Memory errors can take some time to manifest, especially in small cases.
If the case is small you can try running the serial version to see if the problem only occurs in parallel, as for what might be the rootcause of the bad mesh I have no idea.

@timjim333
Copy link
Author

Hi @pcarruscag thanks for the reply. So you think it might be a mesh issue? It might well be possible, as I previously had a structured collar mesh around an unstructured core for supersonic evaluation but I tried to diagonalize the collar mesh as it seemed that AMG refinement only works for triangles and tetrahedrons... I could have made a mistake in this step! I'll take another look. Cheers.

@vdweide
Copy link
Contributor

vdweide commented Jan 6, 2021

Can you run it with valgrind to check if there is a memory issue? Compile with -g. Also, does the problem persist if you reduce the number of MPI ranks?

@timjim333
Copy link
Author

Hi @vdweide can I just double-check what I should try? Compiling SU2 with -g or valgrind? Thanks!

@vdweide
Copy link
Contributor

vdweide commented Jan 6, 2021

Compile with -g or when using meson just add --buildtype=debug to the arguments to build the executable. Then run it as follows

mpirun -np 40 valgrind SU2_CFD case.cfg.

The probably get quite a few false warnings from MPI, but you can filter those out. Try to reduce the number of ranks, if possible.

@timjim333
Copy link
Author

Ok, I've recompiled using --buildtype=debug and I'm running valgrind now. I'll try and run it at a reduced rank and get back to you. Thanks.

@timjim333
Copy link
Author

@vdweide I've attached the SU2 output and the valgrind output running on 2 processes, i.e.: mpirun -n 2 --use-hwthread-cpus valgrind /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg
su2_out_2.txt
valgrind_out_2.txt

I also tried with 30 processes but valgrind gave up after stating that there were too many errors.

Sorry, I'm not so familiar with what to look out for. I'm guessing that something showing in the leak summary is a bad thing? Thanks

@timjim333
Copy link
Author

In case it helps, I also ran valgrind using --leak-check=full and --track-origins=yes. I've attached the outputs here.
valgrind_out_2_leakcheck.txt
valgrind_out_2_origins.txt

@vdweide
Copy link
Contributor

vdweide commented Jan 6, 2021

No, the invalid reads and writes are problematic. There you cross the boundaries of allocated memory and anything can happen.
What version/branch are you using? The line numbers valgrind gives do not correspond to the current develop version.

@timjim333
Copy link
Author

I'm using the 'feature_adap' branch. At least, I believe I am; I pulled the repo in this manner:

git clone https://github.com/su2code/SU2.git SU2_src
cd SU2_src
git checkout feature_adap

As far as I can tell, it's v7.0.3.

@vdweide
Copy link
Contributor

vdweide commented Jan 6, 2021

That's indeed how you get the feature_adap branch.
Is it possible to merge this branch with the latest version of develop first?

@timjim333
Copy link
Author

I had a quick look at the merging process and it seems like quite a few files conflict. I'm not sure which files I can merge from develop and not accidentally break the feature_adap functionality. Can I more or less pull across most of these changes? I can give it a go if you can give me some pointers but I'm not well-versed in cpp! Thanks.

Common/include/CConfig.hpp
Common/include/adt/CADTElemClass.hpp
Common/include/geometry/dual_grid/CEdge.hpp
Common/include/geometry/dual_grid/CPoint.hpp
Common/include/geometry/dual_grid/CVertex.hpp
Common/include/option_structure.hpp
Common/src/adt/CADTElemClass.cpp
Common/src/geometry/CPhysicalGeometry.cpp
Common/src/geometry/dual_grid/CPoint.cpp
SU2_CFD/include/output/COutputLegacy.hpp
SU2_CFD/include/solvers/CEulerSolver.hpp
SU2_CFD/include/solvers/CSolver.hpp
SU2_CFD/src/iteration_structure.cpp
SU2_CFD/src/numerics/flow/flow_diffusion.cpp
SU2_CFD/src/output/CFlowCompOutput.cpp
SU2_CFD/src/output/output_structure_legacy.cpp
SU2_CFD/src/solvers/CEulerSolver.cpp
SU2_CFD/src/solvers/CNSSolver.cpp
SU2_CFD/src/solvers/CSolver.cpp
SU2_CFD/src/solvers/CTurbSASolver.cpp
SU2_CFD/src/solvers/CTurbSSTSolver.cpp
SU2_CFD/src/solvers/CTurbSolver.cpp
SU2_CFD/src/variables/CEulerVariable.cpp
SU2_DOT/src/meson.build
SU2_IDE/Xcode/SU2_CFD.xcodeproj/project.pbxproj
SU2_PY/pySU2/pySU2.i
SU2_PY/pySU2/pySU2ad.i
meson_scripts/init.py
preconfigure.py

@vdweide
Copy link
Contributor

vdweide commented Jan 7, 2021

No, you cannot just do that. Somebody who worked on feature_adap should have a look at it. @bmunguia, it looks like you made the latest commit to this branch, but that is already quite some time ago (May 2020). What is the current status and do you plan to merge with the latest version of develop?

@timjim333
Copy link
Author

I see, I hope that @bmunguia will have a chance to take a look! I tried to have a look through the past commits but I didn't manage to successfully merge all the functions. From what I can tell, these are the edited variables/functions:

CConfig Class

  • WRT_SLICE
  • GetBool_Compute_Metric
  • GetWrt_Aniso_Sensor
  • GetKind_Aniso_Sensor
  • GetKind_Hessian_Method
  • GetAdap_Norm
  • GetAdap_Hmax
  • GetAdap_Hmin
  • GetAdap_ARmax
  • GetAdap_Complexity

Cvertex (not sure if values should be initialised)

  • ~CVertex
  • GetnDonorPoints
  • SetDonorCoeff
  • GetDonorCoeff
  • SetInterpDonorPoint
  • GetInterpDonorPoint
  • SetInterpDonorProcessor
  • GetInterpDonorProcessor
  • Allocate_DonorInfo
  • GetVarRot
  • SetVarRot

Option_structure

  • ENUM_ANISO_SENSOR
  • MapType
  • ENUM_OUTPUT
  • MakePair("INRIA", INRIA)
  • MPI_QUANTIFIES enums

CPhysicalGeometry - probably AMG stuff?

  • CPhysicalGeometry::LoadAdaptedMeshParallel_FVM
  • CPhysicalGeometry::Check_IntElem_Orientation
  • CPhysicalGeometry::Check_BoundElem_Orientation

Common/src/geometry/dual_grid/CPoint.cpp'

  • Check this one!

COutputLegacy.hpp

  • Import inria amg

output_structure_legacy.cpp'

  • SpecialOutput - Inria methods?

Csolver

  • CSolver::SetPositiveDefiniteHessian

meson.build

  • Add 'output/filewriter/CInriaFileWriter.cpp',

Init.py - add amgio stuff

  • Add sha_version_amg
  • Add github_repo_amg

Preconfigure.py

  • Add init_inria
  • Add other inria flags

@timjim333
Copy link
Author

I'm unsure if the AMG version uses its own implementation of vertices etc. or if these happen to be the way that they were implemented in older versions of SU2.

@pcarruscag
Copy link
Member

Most likely a mixture of those 2 things, but it should not be too difficult to fix.

@timjim333
Copy link
Author

@pcarruscag Could help me take a look through or give me some pointers on where to start? I've not programmed in C++ before (maybe a good time to start with lockdown...) but if it's the case of figuring out how to merge already working code, I might be able to hack together something. To be honest, though, it might be better/faster for someone who actually knows what they're doing to do so!

I wanted to use this functionality as part of another project, so I'm just wary of breaking something not obvious in the background.

@pcarruscag
Copy link
Member

I could but I do not think updating that branch will fix your problem. We have not found any mesh handling bugs recently.
Creating / modifying meshes manually can get tricky (at least in my experience).
Have you tried simpler problems? Try starting with a problem that is known to work (there is a long issue with success stories, do a search for mesh adaptation here on github). Then build up from it, e.g. take the same problem and use a finer grid, change the physics to what you need, use a grid for your problem (ideally change one thing at a time).
Also keep in mind that if that branch was finished work it would probably have been merged into develop by now...

@timjim333
Copy link
Author

@pcarruscag I see, sorry I didn't realise that it still might be a mesh problem - I thought it was a memory issue from the error messages! Ok, I'll give it another try from scratch. If I understand correctly, amg only works with triangles and tetrahedrons, not pyramids or quads, is that right? Thanks again.

@timjim333
Copy link
Author

Hi @pcarruscag I just tried a simpler mesh and using MPI I get the UCX crash.
err_log_SU2v7.0.3.txt

To double check, I also used the master v7.0.8 SU2_CFD. When I run with MPI, I get the UCX error but when I run in serial, the solution appears to converge fine. I suspect that this means it's probably not the mesh that is causing the issues - what are your thoughts?
su2_out_serial.txt

@pcarruscag
Copy link
Member

I looked for "UCX error" and got e.g. this openucx/ucx#4742
IDK but it looks like an MPI configuration problem...

@timjim333
Copy link
Author

Interesting - my MPI is straight from the CentOS repo, so I didn't expect it to be the issue but I'll try to compile another version just to check.

@timjim333
Copy link
Author

After pulling in the latest OpenMPI v3 (3.1.6) and recompiling the mpi4py and SU2 branch, this error seems to have gone away! Thank you for your help @pcarruscag @vdweide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants