Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is alarming...Engine and mdserver linked with DB plugin libs! #19565

Open
markcmiller86 opened this issue May 25, 2024 · 27 comments
Open

This is alarming...Engine and mdserver linked with DB plugin libs! #19565

markcmiller86 opened this issue May 25, 2024 · 27 comments
Assignees
Labels
bug Something isn't working dependencies Pull requests that update a dependency file impact high Productivity significantly degraded (non-mitigable bug) or improved (enhancement) priority a priority ticket reviewed Issue has been reviewed and labeled by a developer
Milestone

Comments

@markcmiller86
Copy link
Member

markcmiller86 commented May 25, 2024

Go to build dir for engine and do a make clean; make VERBOSE=1 >& junk.out and then grep junk.out for hdf5. You will get hits. But, you should NOT get hits for hdf5. hdf5 is used only in a database plugin lib. If I look at a link of the libengine_ser.dylib, I get all the items listed below.

We are also getting hits for conduit, condiut_relay and blueprint. See below. Again, those are only used in DB plugins and should not be being linked into the engine.

I believe the reason this is happening is the use of MFEM in the engine. That is fine. But, we cannot use an MFEM in the engine that depends on I/O libs needed only in the plugins. We need to build MFEM differently for use in the engine.

Taggiging @iulian787 and @vijaysm because this is impacting MOAB plugin which uses HDF5 in either serial or parallel and the fact that engine is loading hdf5 serial prevents parallel MOAB plugin from operating correctly.

/usr/bin/clang++
-fno-common
-fexceptions
-g
-isysroot
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.sdk
-dynamiclib
-Wl,-headerpad_max_install_names
-o
../../lib/libengine_ser.dylib
-install_name
@rpath/libengine_ser.dylib
CMakeFiles/engine_ser.dir/ProgrammableCompositer.C.o
CMakeFiles/engine_ser.dir/DataNetwork.C.o
CMakeFiles/engine_ser.dir/ClonedDataNetwork.C.o
CMakeFiles/engine_ser.dir/CumulativeQueryNamedSelectionExtension.C.o
CMakeFiles/engine_ser.dir/Engine.C.o
CMakeFiles/engine_ser.dir/EngineBase.C.o
CMakeFiles/engine_ser.dir/LoadBalancer.C.o
CMakeFiles/engine_ser.dir/MesaDisplay.C.o
CMakeFiles/engine_ser.dir/Netnodes.C.o
CMakeFiles/engine_ser.dir/NetworkManager.C.o
CMakeFiles/engine_ser.dir/VisItDisplay.C.o
-L/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/conduit/v0.9.1/darwin-x86_64/lib
-L/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/hdf5/1.8.14/darwin-x86_64/lib
-L/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/zlib/1.2.13/darwin-x86_64/lib
-L/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/mfem/4.6/darwin-x86_64/lib
-Wl,-rpath,/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/conduit/v0.9.1/darwin-x86_64/lib
-Wl,-rpath,/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/hdf5/1.8.14/darwin-x86_64/lib
-Wl,-rpath,/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/zlib/1.2.13/darwin-x86_64/lib
-Wl,-rpath,/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/mfem/4.6/darwin-x86_64/lib
-Wl,-rpath,/Users/miller86/visit/visit/34rc/build/lib
../../lib/libenginerpc.dylib
../../lib/libavtwriter_ser.dylib
../../lib/libavtquery_ser.dylib
../../lib/libavtviswindow_ser.dylib
-lmfem
-lhdf5
-lz
-lhdf5
-lz
-lz
-lz
../../lib/libavtblueprint.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkIOPLY.9.2.6.dylib
../../lib/libavtshapelets.dylib
../../lib/libavtexpressions_ser.dylib
../../lib/libavtdbin_ser.dylib
../../lib/libavtpythonfilters_ser.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/python/3.9.18/darwin-x86_64/lib/libpython3.9.dylib
../../lib/libavtplotter_ser.dylib
../../lib/libavtfilters_ser.dylib
../../lib/libavtview.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingFreeType.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkfreetype.9.2.6.dylib
../../lib/libavtdatabase_ser.dylib
../../lib/libavtmir_ser.dylib
../../lib/libavtpipeline_ser.dylib
../../lib/libavtdbatts.dylib
../../lib/libvisit_verdict.dylib
../../lib/libtess2.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingRayTracing.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkDomainsChemistry.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingSceneGraph.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingOpenGL2.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingHyperTreeGrid.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersHybrid.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingUI.9.2.6.dylib
-framework
Cocoa
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkglew.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingVolume.9.2.6.dylib
-Xlinker
-framework
-Xlinker
OpenGL
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkInteractionStyle.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkRenderingCore.9.2.6.dylib
../../lib/libvisit_vtk.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersExtraction.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkIOXMLParser.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersFlowPaths.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersGeometry.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkImagingHybrid.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkIOImage.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/zlib/1.2.13/darwin-x86_64/lib/libz.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtktiff.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkImagingCore.9.2.6.dylib
../../lib/libavtmath.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersModeling.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersSources.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersGeneral.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonComputationalGeometry.9.2.6.dylib
-lmfem
../../lib/liblightweight_visit_vtk.dylib
../../lib/libvisitcommon.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/zlib/1.2.13/darwin-x86_64/lib/libz.dylib
-ldl
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkIOLegacy.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkIOCore.9.2.6.dylib
-lconduit
-lconduit_relay
-lconduit_blueprint
-lhdf5
-lz
-lhdf5
-lz
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkFiltersCore.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonExecutionModel.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonDataModel.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonTransforms.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonMisc.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonMath.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkkissfft.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtkCommonCore.9.2.6.dylib
/Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/vtk/9.2.6/darwin-x86_64/lib/libvtksys.9.2.6.dylib
@markcmiller86 markcmiller86 added bug Something isn't working impact high Productivity significantly degraded (non-mitigable bug) or improved (enhancement) priority a priority ticket dependencies Pull requests that update a dependency file labels May 25, 2024
@markcmiller86
Copy link
Member Author

If I disable mfem, I get the same problem with conduit. The engine includes and links with conduit so any dependencies for conduit get linked into the engine.

@markcmiller86
Copy link
Member Author

@iulian787 and @vijaysm if I disable conduit, mfem and fsm in my build, I can get the engine to run and use MOAB in parallel correctly...

Screen Shot 2024-05-26 at 8 01 15 PM

@vijaysm
Copy link

vijaysm commented May 27, 2024

That is excellent news! Thanks for checking this thoroughly @markcmiller86

When you say MOAB parallel engine is working correctly, I assume this means there are no HDF5 property table errors that we were seeing before? I am still puzzled as to why the Ubuntu 22 version of engine_par is still failing (eventhough ldd verification showed that it was not linked against HDF5-serial).

@markcmiller86
Copy link
Member Author

When you say MOAB parallel engine is working correctly, I assume this means there are no HDF5 property table errors that we were seeing before?

Yes, correct.

I am still puzzled as to why the Ubuntu 22 version of engine_par is still failing (eventhough ldd verification showed that it was not linked against HDF5-serial).

I am too but I believe linux is more lenient about shared lib dependencies and so think the engine is still loading the serial hdf5 library. An strace would confirm that.

@markcmiller86
Copy link
Member Author

Was chatting with @brugger1 about this. One idea he has...build engine_par against hdf5_mpi instead of hdf5. That would fix the collision with MOAB parallel database plugin.

But, it would mean we continue to link the engine (and mdsever) against hdf5 and that would mean nobody would be allowed to build a custom plugin using a different version of hdf5 than what VisIt was built with.

If VisIt was not on such an ancient version of hdf5, that is probably ok. But, because we are on such an ancient version of hdf5, it would likely conflict with any plugin developer using many of the newer features in newer versions of hdf5.

So, we would need to upgrade hdf5 in VisIt.

That said, I still think it would be best to get hdf5 out of the engine and mdserver if possible.

@biagas
Copy link
Contributor

biagas commented May 28, 2024

@markcmiller86 conduit and mfem were added as dependencies when avt/Blueprint and avt/MFEM were added.
With those additions, we also get whatever dependencies conduit and mfem have.

@markcmiller86
Copy link
Member Author

@markcmiller86 conduit and mfem were added as dependencies when avt/Blueprint and avt/MFEM were added.
With those additions, we also get whatever dependencies conduit and mfem have.

Yes, that is right. So, it may mean we have to build those libs two different ways...one way for VisIt components and another way for database plugins...only the latter of which can depend on things like hdf5, netcdf, etc.

@cyrush
Copy link
Member

cyrush commented May 28, 2024

@markcmiller86 Both Conduit and MFEM are used outside of the database plugins.
We have avt libs that provide MFEM to VTK and Conduit to VTK as a general service for the engine.

While Conduit's I/O that uses HDF5 does not need to link HDF5, I think that MFEM actually links Conduit relay, which does link HDF5.

In general unless we have a fully name mangled serial hdf5 and mpi hdf5, I think we have an issue regardless if it's DB only vs in the engine. Yes we can disable plugins, but that approach won't work for an install that can be widely used

@markcmiller86
Copy link
Member Author

Both Conduit and MFEM are used outside of the database plugins.

Sure...but I guess my question is...what do conduit and MFEM need to do in the way of I/O with HDF5 in the engine and mdserver? I don't think the engine or mdserver need to do any I/O with or without HDF5 and so the question remains...why do business this way? It can't be for the convenience of a 3rd party lib dependency that isn't designed to build without HDF5?

As an aside, as things are designed now, no one building a custom HDF5 plugin would be able to build against an hdf5 version other than 1.8.XX (1.8.14 is over 10 years old now). So, someone wishing to use newer features in HDF5 would simply not be able to use a newer HDF5. That paricular issue is solved, of course, by updating to newer HDF5 in VisIt. But, that only minimizes that particular issue. It doesn't fix it.

@cyrush
Copy link
Member

cyrush commented May 28, 2024

The answer is simple: MFEM does not have multiple libraries that partition features based on dependencies.

Those features are either on or off for a build of mfem.

Conduit has multiple libraries - (relay is the one with all the i/o deps), but if MFEM is using Conduit, it will also link those I/O libs.

@cyrush
Copy link
Member

cyrush commented May 28, 2024

We are going to explore building MPI enabled HDF5 will work for all cases (engine_ser and engine_par)

@cyrush cyrush added the reviewed Issue has been reviewed and labeled by a developer label May 28, 2024
@markcmiller86
Copy link
Member Author

markcmiller86 commented May 28, 2024

Ok, I tried a simple test with my build of VisIt 3.4.1. on macOS where I have disabled MFEM and conduit. But, I do have things like Silo (which is using HDF5 in serial) and MOAB (using HDF5 in serial in a serial engine and in parallel in a parallel engine).

  1. Start VisIt ./bin/visit -np 4 (confirm I've got a parallel engine)
  2. Open Silo's multi_ucd3d.silo in the silo_hdf5_test_data directory (confirm it is indeed Silo/HDF5 data...it is)
  3. Put up a mesh and PC plot...works fine!
  4. Delete the plots and close the database...or not...it doesn't change the outcome
  5. Open a MOAB database (which will trigger the MOAB parallel plugin)
  6. Put up a PC plot of ELEM_GLOBAL_ID...it plots fine...in particular, no hdf5 api trace errors on stdout/stderr...which we have seen when its wound up confusing parallel HDF5 with serial HDF5

So, this works. And, that is because we DO NOT load plugin shared libraries using RTLD_GLOBAL which would make the symbols from the loaded library visible in the global namespace of the calling executable. We DO NOT specify RTLD_LOCAL either but it turns out if neither is specified, RTLD_LOCAL is the default behavior (on both macOS and Linux). See related ChatGPT discussion about this.

Below, I use macOS lsof to report which libraries are actually loaded into the running engine_par process using one of the PIDs...You can see it has BOTH serial and parallel HDF5 libraries loaded.

ps -ef | grep engine_par
 3640 89593 89592   0  4:24PM ??         0:02.03 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
 3640 89594 89592   0  4:24PM ??         0:11.80 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
 3640 89595 89592   0  4:24PM ??         0:11.84 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
 3640 89596 89592   0  4:24PM ??         0:11.82 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
[scratlantis:5.5.0/darwin-x86_64/lib] miller86% lsof -p 89593 | grep hdf
engine_pa 89593 miller86  txt    REG                1,4  3910464          1282989090 /Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/hdf5/1.8.14/darwin-x86_64/lib/libhdf5.9.dylib
engine_pa 89593 miller86  txt    REG                1,4  5073584          1300738963 /Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/moab_mpi/5.5.0-hdf5-1.14.3/darwin-x86_64/lib/libMOAB_mpi.5.dylib
engine_pa 89593 miller86  txt    REG                1,4  9282624          1300717741 /Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/hdf5_mpi/1.14.3/darwin-x86_64/lib/libhdf5_mpi.310.dylib

@markcmiller86
Copy link
Member Author

@iulian787 and @vijaysm one option we're considering here is to do away with serial/parallel builds of HDF5. We would build only parallel HDF5 and everything in VisIt that depended on HDF5 would be linked to that one, single parallel HDF5. The "serial" tools would be have to be linked with -lmpi for example, but would, in theory anyway, never reference the MPI symbols in them.

What would you think of this?

@markcmiller86
Copy link
Member Author

@cyrush if we can do that with HDF5, why can't we do that with all of VisIt and get away from building all of VisIt with _par and _ser variants of everything. We just build everything parallel and agree we never reference the mpi symbols when running in serial?

@vijaysm
Copy link

vijaysm commented May 29, 2024

The "serial" tools would be have to be linked with -lmpi for example, but would, in theory anyway, never reference the MPI symbols in them.

What would you think of this?

Excellent! This was my suggestion long time back. I do this all the time in my workflows using MPI wrappers to build every library in my systems, whether it is serial code or MPI aware one.

Then we would just build MOAB+HDF5 without worrying about serial builds, with a guarantee that only MPI aware HDF5 will ever be loaded by Visit. Would also simplify builds in general and reduce distribution size :-)

@cyrush
Copy link
Member

cyrush commented May 29, 2024

@cyrush if we can do that with HDF5, why can't we do that with all of VisIt and get away from building all of VisIt with _par and _ser variants of everything. We just build everything parallel and agree we never reference the mpi symbols when running in serial?

Hi Mark,
In VisIt itself - we use a single source to produce both serial and MPI libs. Things aren't partitioned.

This is convenient for many filters that share logic and then add extra communication for the MPI case.

MPI support is controlled by compiler defines, which yield the serial and parallel libs.
It would require refactoring and a runtime (instead of compile time) switch to be added.

It's possible to do, but would be a major change.

@markcmiller86
Copy link
Member Author

MPI support is controlled by compiler defines, which yield the serial and parallel libs.

@cyrush...right...we simply adjust those #if PARALLEL code blocks to all be run-time conditionals and then we build only MPI-enabled object files and libs. This would probably halve our compile time, halve our distribution sizes and just generally simplify a lot of things in our CMake logic, build logic, plugin logic, etc.

@markcmiller86
Copy link
Member Author

markcmiller86 commented Jun 1, 2024

Summarizing...

Ok, so inputs from @cyrush, @qkoziol, and @vijaysm all suggest the right way to proceed is to do away with building dependencies in different ways (e.g. with and without MPI) and just know that running a serial VisIt will never reference any MPI enabled code blocks in VisIt itself or any dependencies. It means a serial VisIt engine is still linked with -lmpi for example to satisfy the linker.

Above, @cyrush mentioned another issue I hadn't really appreciated before digging into this in detail. In VisIt, we have a lot of code blocks of the form...

#if PARALLEL 
    // do something for parallel with MPI_Xxx() calls
#else 
   // do it the serial way
#endif

This really does mean you have to compile two different ways to get two different behaviors. To do the same thing in VisIt proper with various libavtThisOrThat.so as we are aiming with TPL libs, we would need all those to blocks to be chosen at run time, not compile time.

We're kinda forced into this situation because libraries we want to use in VisIt proper such as MFEM have an indirect dependency on HDF5. So, the engine is going to wind up getting linked with an HDF5 regardless.

But, because VisIt's HDF5 is ancient (1.8.14), we really must upgrade that asap to latest HDF5 on develop. We are ok leaving it a 1.8.14 on the RC.

Here is the work to complete for this ticket then...

  • upgrade HDF5 to 1.14.4 on develop
  • Modify build_visit to build HDF5 and MOAB only once (with MPI if build_visit is building a parallel-enabled VisIt and without MPI if build_visit is building only a serial VisIt)
  • Upgrade MOAB library being used in VisIt (on RC and develop) to 5.5.1
  • Adjust any other build_visit TPLs that have followed this paradigm (e.g. ADIOS2)
  • Build 3.4 series of VisIt for ANL ubuntu, gce, polaris and crysalis WITHOUT mfem and conduit (this will remove the internal HDF5 dependency and allow MOAB/HDF5 to work as is there without loosing any functionality essential to ANL teams)

@markcmiller86 markcmiller86 added this to the 3.4.2 milestone Jun 1, 2024
@qkoziol
Copy link

qkoziol commented Jun 1, 2024

I suggest mov8ng all the way up to HDF5 1.14.4 🙂

@cyrush cyrush modified the milestones: 3.4.2, 3.5.0 Jul 18, 2024
@markcmiller86
Copy link
Member Author

@cyrush, @qkoziol, and @vijaysm, one issue I am encountering is all the libraries VisIt builds which depend on HDF5 (conduit, mfem, cgns, netcdf, xdmf, silo, h5part). They themselves may not use MPI. But, when I compile them against a parallel-enabled HDF5, I get failures for #include <mpi.h> from #include <hdf5.h>. Maybe this means that now, those libs will require -I<path-to-mpi-headers> when then are being compiled AND using hdf5. I guess they will also need a -L<path-to-mpi-library> -lmpi too anyways in order to link a dynamic library.

@cyrush ... in particular, will this new approach defeat our ability to ever build VisIt static-only?

@vijaysm
Copy link

vijaysm commented Aug 7, 2024 via email

@qkoziol
Copy link

qkoziol commented Aug 7, 2024

@vijaysm 's suggestion will address the immediate concern, but I think we can also do better than that.

@markcmiller86 @cyrush - When building a serial application (or library) against a parallel-enabled build of HDF5, would a compile-time macro (i.e. a "-DFOO" flag to the compiler) that disabled the include of the MPI header within the hdf5.h header be more helpful? I could wrap the #include <mpi.h> in the hdf5.h header with a #ifndef FOO block, so that the MPI header would be skipped for that build. You would then also need to link in a static library that had NOP stubs for all the MPI functions that the HDF5 library uses internally (which it won't call, since you are running a serial application/library), to get past the linking problem.

Thoughts? If that sounds link it would be a solution (it should be easy to hack the hdf5.h header and create such a library of NOP MPI function stubs, if you'd like to prototype it first), I can create a PR for the change to the hdf5.h header and an update on the docs that describes why it's there and how to take advantage of it. And I can create a new project in the HPC-IO github org that had the necessary NOP MPI function stubs. I imagine that the HDF Group might adopt a fork of my NOP MPI library.

(@derobins @sbyna - for visibility)

@markcmiller86
Copy link
Member Author

@vijaysm and @qkoziol thanks for your inputs on this 💪.

@qkoziol I like some of the ideas your are floating here. But, MPI stub functions scare me just a tad due to the possibility of confusion/collision with the real mpi. I was burned by this kind of thing years ago in another project and never forgot it...maybe because of the number of days I spent failing to understand what was happening.

How realistice is it for HDF5 library to be structured such that a single installation of HDF5 that supports both serial and parallel involves linking to one lib for serial applications and two libs for parallel applications? Likewise, maybe there is one header for serial applications and two header files to be included for parallel applications? Your -DFOO approach effectively gives a client the ability to treat hdf5.h in two different ways so maybe the header file issue is solved with that. I mean, it works for me.

But, the library linking issue is still a concern and while I see why HDF5 developers might wanna go the MPI stub route, I just wanna ask about the refactor MPI stuff to a separate lib route.

@cyrush
Copy link
Member

cyrush commented Aug 7, 2024

@qkoziol I think for HDF5 at least, we haven't had an issue where we were required use MPI includes due to HDF5 headers.
Conduit does have its own compile time defs that will only be enabled when Conduit's MPI support is on.

@markcmiller86 I think conduit should be ok, can you share if my assumption is wrong?

It does look like h5public.h will include <mpi.h> when hdf5 was compiled with MPI support.
I don't know what the include chain is, but is h5public.h pulled in by most of the interface?
(If so I am puzzled why this hasn't been an issue in conduit for me)

Usually, the problem is a linking issue -- not a compile header interface issue.

@markcmiller86
Copy link
Member Author

@cyrush I can answer some of the details of what is happening with build_visit.

Most of the libs that depend on only sereial HDF5 are compiled using build_visit's C_COMPILER and CXX_COMPILER vars. Those are typically set to the underlying compilers used in the mpi compiler wrappers.

Only those libs with parallel variants (e.g. conduit, hdf5, adios) are compiled with build_visit's PAR_COMPILER and PAR_COMPILER_CXX which typically resolve to mpi wrapper compilers.

So, when were using a parallel-enabled HDF5, we build Silo with gcc, not mpicc and the #include <mpi.h> fails without explicitly adding -I<path-to-mpi-headers to Silo's build.

@markcmiller86
Copy link
Member Author

I would much prefer to live in a world where I am not required to specify a path to a header file (or for that matter a library file) my package does not itself use 🤣.

@cyrush another detail here that may be relevant, I think a lot of the packages in play here use Autotools, not CMake. As part of the update to HDF5-1.14.4, I have adjusted bv_hdf5.sh to use CMake.

@cyrush
Copy link
Member

cyrush commented Aug 8, 2024

thanks @markcmiller86 , I think I have tried this case outside of build_visit (w/ spack). I can try a simpler case and see if there is an issue. I just don't recall getting tripped up.

You can't avoid the link dependency, but with the right setup (defines and generated headers) I am confident that libraries can avoid the implicit header problem, that is to avoid it being an issue in downstream consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependencies Pull requests that update a dependency file impact high Productivity significantly degraded (non-mitigable bug) or improved (enhancement) priority a priority ticket reviewed Issue has been reviewed and labeled by a developer
Projects
None yet
Development

No branches or pull requests

5 participants