Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATDM Trilinos 'sems-rhel7' configuration broken on new CEE hpwsXYZ and cee-buildXYZ machines #10022

Closed
bartlettroscoe opened this issue Dec 14, 2021 · 13 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Dec 14, 2021

@trilinos/framework

Description

It seems the sems modules are broken on the new CEE 'hpwsXYZ' and 'cee-buildXYZ' machines. And the Trilinos GenConfig scripts don't detect they are broken (see below).

Original Description

It appears that the SEMS RHEL7 modules or perhaps just the ATDM Trilinos 'sems-rhel7' configuration is broken on the new CEE HPWS machines. It seems the builds complete but a lot of tests fail with errors like:

symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/intel/18.0.5/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers

all showing undefined symbol: ompi_common_verbs_usnic_register_fake_drivers.

To demonstrate this, I used a recent version of Trilinos 'develop' from ATDM Trilinos testing day 2021-12-21 55091be as shown by:

$ git log-short --name-status --graph 2be5702

*   2be5702 "Merge remote-tracking branch 'origin/develop' into atdm-nightly"
|\  Author: trilinos-autotester <trilinos@sandia.gov>
| | Date:   Fri Dec 10 21:05:07 2021 -0700 (4 days ago)
| |     
| *   55091be "Merge Pull Request #10005 from trilinos/Trilinos/csiefer-827cd90"
| |\  Author: trilinos-autotester <trilinos@sandia.gov>
| | | Date:   Fri Dec 10 13:04:17 2021 -0700 (4 days ago)

that was pretty clean as shown in this query.

I demonstrated the problem by running, on my machine 'hpws055', I ran all of the supported ATDM Trilinos 'sems-rhel7' builds for just the Teuchos test suite as:

$ env Trilinos_PACKAGES=Teuchos ./ctest-s-local-test-driver.sh all

***
*** ./ctest-s-local-test-driver.sh
***

ATDM_TRILINOS_DIR = '/fgs/rabartl/Trilinos.base2/Trilinos'

Load some env to get python, cmake, etc ...

Hostname 'hpws055.sandia.gov' matches known ATDM host 'hpws055.sandia.gov' and system 'sems-rhel7'
Setting compiler and build options for build-name 'default'
Using SEMS RHEL7 compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL

Running builds:
    sems-rhel7-clang-7.0.1-openmp-shared-release-debug
    sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug
    sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug
    sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug
    sems-rhel7-intel-18.0.5-openmp-shared-debug
    sems-rhel7-intel-18.0.5-openmp-shared-release-debug
Detailed output from ctest-s-local-test-drivers.sh (click to expand)

.

Tue Dec 14 09:40:26 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug/smart-jenkins-driver.out

real    2m27.954s
user    18m49.847s
sys     0m54.898s

14% tests passed, 126 tests failed out of 146

Tue Dec 14 09:42:54 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug/smart-jenkins-driver.out

real    4m8.488s
user    28m55.797s
sys     3m33.693s

14% tests passed, 126 tests failed out of 146

Tue Dec 14 09:47:03 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug/smart-jenkins-driver.out

real    2m1.078s
user    16m13.491s
sys     1m11.505s

14% tests passed, 126 tests failed out of 146

Tue Dec 14 09:49:04 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug/smart-jenkins-driver.out

real    6m24.248s
user    65m14.843s
sys     2m50.406s

14% tests passed, 126 tests failed out of 146

Tue Dec 14 09:55:28 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug/smart-jenkins-driver.out

real    1m55.680s
user    14m50.879s
sys     2m12.194s

14% tests passed, 126 tests failed out of 146

Tue Dec 14 09:57:24 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug/smart-jenkins-driver.out

real    6m17.964s
user    68m52.389s
sys     3m2.706s

14% tests passed, 126 tests failed out of 146

Tue Dec 14 10:03:42 MST 2021

Done running all of the builds!

which posted to CDash as shown here:

There are 126 failing tests for each build, 756 failing tests across all of these builds as shown here:

All of these failing tests show the error undefined symbol: ompi_common_verbs_usnic_register_fake_drivers as shown in this query:

Internal issues:

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests Framework tasks Framework tasks (used internally by Framework team) impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area labels Dec 14, 2021
@bartlettroscoe
Copy link
Member Author

@trilinos/framework, my understanding is that CEE is pushing everyone to move to these HPWS machines and is eliminating the RWS and EWS machines. I know the current ATDM Trilinos 'sems-rhel7' builds are running on ascicgpu machines but most Trilinos developers can't get access to those machines. (For example, I don't think I currently have access to any ascigpu machines.)

I need to have a working set of Trilinos builds in order to test TriBITS work associated with TriBITSPub/TriBITS#367. Otherwise, I just have to do my best to locally test and cross fingers and merge to Trilinos 'develop'. But I would rather run a more comprehensive set of builds before merging future TriBITS changes.

@e10harvey
Copy link
Contributor

e10harvey commented Dec 14, 2021

@bartlettroscoe: I don't see this build error showing up in the PrimaryATDM or SecondaryATDM builds. I will raise this github issue during our meeting tomorrow.

CC: @jwillenbring, @ZUUL42

@bartlettroscoe
Copy link
Member Author

I don't see this build error showing up in the PrimaryATDM or SecondaryATDM builds. I will raise this github issue during our meeting tomorrow.

@e10harvey, I suspect that is because the builds posting to CDash are running on ASCIGPU machines, not HPWS machines. The latter are fairly new. See my comment above.

@bartlettroscoe
Copy link
Member Author

FYI: As a basis of comparison, I ran the exact same Teuchos 'sems-rhel7' builds with the exact same version of Trilinos on the CEE build machine 'cee-build015' and I got all passing builds and tests (except for 3 failing tests from the CUDA build because this machine does not have a GPU). This shows the problem is with the SEMS modules and/or the ATDM Trilinos env and/or the HPWS machines themselves.

Details of Teuchos 'sems-rhel7' builds on 'cee-build015' and results on CDash (click to expand)

.

Now to run the full set of supported builds for Teuchos on the machine 'cee-build015':

$ ssh cee-build015

$ cd /scratch/rabartl/Trilinos.base/BUILDS/ATDM/SEMS-RHEL7/CTEST_S/

$ ln -s /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/ctest-s-local-test-driver.sh .

$ env Trilinos_PACKAGES=Teuchos ./ctest-s-local-test-driver.sh all

***
*** ./ctest-s-local-test-driver.sh
***

ATDM_TRILINOS_DIR = '/scratch/rabartl/Trilinos.base/Trilinos'

Load some env to get python, cmake, etc ...

Hostname 'cee-build015' matches known ATDM host 'cee-build015' and system 'sems-rhel7'
Setting compiler and build options for build-name 'default'
Using SEMS RHEL7 compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL

Running builds:
    sems-rhel7-clang-7.0.1-openmp-shared-release-debug
    sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug
    sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug
    sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug
    sems-rhel7-intel-18.0.5-openmp-shared-debug
    sems-rhel7-intel-18.0.5-openmp-shared-release-debug

Tue Dec 14 12:08:29 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug/smart-jenkins-driver.out

real    3m27.990s
user    32m20.235s
sys     3m51.218s

100% tests passed, 0 tests failed out of 146

Tue Dec 14 12:11:57 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug/smart-jenkins-driver.out

real    6m15.396s
user    47m11.151s
sys     9m37.938s

98% tests passed, 3 tests failed out of 146

Tue Dec 14 12:18:13 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug/smart-jenkins-driver.out

real    2m56.085s
user    25m14.042s
sys     4m56.947s

100% tests passed, 0 tests failed out of 146

Tue Dec 14 12:21:09 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug/smart-jenkins-driver.out

real    7m40.898s
user    89m47.930s
sys     8m25.468s

100% tests passed, 0 tests failed out of 146

Tue Dec 14 12:28:50 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug/smart-jenkins-driver.out

real    2m38.976s
user    20m0.146s
sys     6m37.105s

100% tests passed, 0 tests failed out of 146

Tue Dec 14 12:31:29 MST 2021

Running Jenkins driver Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug.sh ...

    Creating directory: Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug

    Creating directory: SRC_AND_BUILD

    See log file Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug/smart-jenkins-driver.out

real    7m23.097s
user    89m8.752s
sys     8m36.234s

100% tests passed, 0 tests failed out of 146

Tue Dec 14 12:38:52 MST 2021

Done running all of the builds!

posted to:

with only 3 failing tests in the CUDA build:

and all 3 because this machine does not have a GPU and displays the error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorInsufficientDriver): CUDA driver version is insufficient for CUDA runtime version /scratch/rabartl/Trilinos.base/BUILDS/ATDM/SEMS-RHEL7/CTEST_S/Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:253

as shown in this query:

@bartlettroscoe
Copy link
Member Author

FYI: The RHEL7 version is newer on the HPWS machine 'hpws055' than on the older CEE machine 'cee-build015'. The details are given in TRILINOSHD-59. It may just be that the SEMS modules are broken for the newer OS. Hopefully the newer SEMS modules build with Spack will work okay.

@e10harvey
Copy link
Contributor

@fryeguy52: Would you please look into the module issues on the HPWS machine?

@bartlettroscoe
Copy link
Member Author

FYI: I got a rude reminder this is still broken as documented in #10836 (comment).

@bartlettroscoe
Copy link
Member Author

And I also hit this as described in #10823 (comment). I don't seem to be remembering that the SEMS modules are broken in the CEE High Performance Work Station (HPWS) machines :-(

@bartlettroscoe
Copy link
Member Author

It seems this same problem is impacting the new 'ascic0xy' machines as well. See #10999.

@bartlettroscoe bartlettroscoe changed the title ATDM Trilinos 'sems-rhel7' configuration broken on new CEE HPWS machines ATDM Trilinos 'sems-rhel7' configuration broken on new CEE hpwsXYZ and cee-buildXYZ machines Sep 27, 2022
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Sep 27, 2022

Well shoot, this same error is occurring on the 'cee-build030' machine as well :-(

I am trying to run full Trilinos PR builds as part of testing a CMake upgrade as part of #10355 on the beefy machine 'cee-build030' and I can't successfully load a GenConfig env.

Specifically, for the env load scirpt load-env.sh:

export TRILINOS_DIR=/fgs/rabartl/Trilinos.base/Trilinos
source ${TRILINOS_DIR}/packages/framework/GenConfig/gen-config.sh \
--cmake-fragment GenConfigSettings.cmake \
rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables \
--force -y \
"$@" \
${TRILINOS_DIR}

When I source it it produces:

$ ssh cee-build030

$ cd /fgs/rabartl/Trilinos.base/BUILDS/PR/pr_builds/clang-10.0.0/

$ source load-env.sh
Using system 'rhel7' based on matching hostname 'cee-build030'.
Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-clang-10.0.0-openmpi-1.10.1-serial' in build name 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched complete configuration 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'
  for build name 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
* CMake fragment file written to: /fgs/rabartl/Trilinos.base/BUILDS/PR/pr_builds/clang-10.0.0/GenConfigSettings.cmake

Using system 'rhel7' based on matching hostname 'cee-build030'.
Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-clang-10.0.0-openmpi-1.10.1-serial' in build name 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Environment 'rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial' validated.
which: no mpicc in (/projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/base/bin:/projects/sems/install/rhel7-x86_64/sems/compiler/gcc/5.3.0/base/bin:/ascldap/users/rabartl/bin:/usr/lib64/qt-3.3/bin:/ascldap/users/rabartl/perl5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin)
which: no mpicxx in (/projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/base/bin:/projects/sems/install/rhel7-x86_64/sems/compiler/gcc/5.3.0/base/bin:/ascldap/users/rabartl/bin:/usr/lib64/qt-3.3/bin:/ascldap/users/rabartl/perl5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin)
which: no mpif90 in (/projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/base/bin:/projects/sems/install/rhel7-x86_64/sems/compiler/gcc/5.3.0/base/bin:/ascldap/users/rabartl/bin:/usr/lib64/qt-3.3/bin:/ascldap/users/rabartl/perl5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin)



********************************************************************************
           E N V I R O N M E N T  L O A D E D  S U C E S S F U L L Y
********************************************************************************
...

And then when I try to configure and build Trilinos, the mpiexec command used by Trilinos produces errors like:

/fgs/rabartl/Trilinos.base/BUILDS/PR/pr_builds/clang-10.0.0/packages/teuchos/core/test/MemoryManagement/TeuchosCore_RCPNodeTracer_UnitTests.exe: symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers

While this may be a problem with the SEMS modules, it seems that GenConfig can't figure out of it loaded the env correctly or not.

@bartlettroscoe
Copy link
Member Author

See #10999 (comment)

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Oct 25, 2023
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants