Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thyra: EpetraOperatorWrapper_UnitTests.cpp cannot open "Trilinos_Util_CrsMatrixGallery.h" blocking all PRs changing Tpetra #10842

Closed
ndellingwood opened this issue Aug 8, 2022 · 33 comments · Fixed by TriBITSPub/TriBITS#512
Assignees
Labels
TriBITS Issues with the TriBITS framework itself, not usage of the TriBITS framework type: bug The primary issue is a bug in Trilinos code or tests

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Aug 8, 2022

Bug Report

@trilinos/thyra

Internal issues:

Description

Compilation of Trilinos fails on SKX architecture (Serial and OpenMP backends, Blake testbed) in the configuration tested below with output message (including compilation line):

[ 94%] Building CXX object packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/CMakeFiles/ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.dir/EpetraOperatorWrapper_UnitTests.cpp.o
cd /ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper && /ascldap/users/projects/x86-64-skylake/openmpi/4.0.1/intel/19.3.199/bin/mpicxx  -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/adapters/epetra/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/interfaces/operator_vector/fundamental -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/interfaces/operator_vector/extended -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/support/operator_vector/client_support -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/support/operator_vector/adapter_support -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/interfaces/operator_solve/fundamental -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/interfaces/operator_solve/extended -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/support/operator_solve/client_support -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/interfaces/nonlinear/model_evaluator/fundamental -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/interfaces/nonlinear/solvers/fundamental -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/src/support/nonlinear/solvers/client_support -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/thyra/core/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/core/example/operator_vector -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/teuchos/core/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/core/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/kokkos/core/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/kokkos/core/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/kokkos -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/parameterlist/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/parser/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/comm/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/numerics/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/rtop/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/rtop/src/interfaces -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/rtop/src/support -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/rtop/src/ops_lib -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/rtop/src/lapack -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/rtop/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/epetra/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/epetra/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/remainder/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/teuchos/remainder/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/teuchos/kokkoscompat/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/kokkoscompat/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/Build/ATDM-Blake-intel19-serial-thyra-mkl/packages/teuchos/kokkoscomm/src -I/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/teuchos/kokkoscomm/src -isystem /ascldap/users/projects/x86-64-skylake/boost/1.65.1/intel/19.3.199/include -g -xCORE-AVX512  -O3 -DNDEBUG -std=c++14 -o CMakeFiles/ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.dir/EpetraOperatorWrapper_UnitTests.cpp.o -c /ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp(54): catastrophic error: cannot open source file "Trilinos_Util_CrsMatrixGallery.h"
  #include "Trilinos_Util_CrsMatrixGallery.h"
                                             ^

compilation aborted for /ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp (code 4)

I haven't attempted bisection yet to determine PR or SHA when breakage began

Steps to Reproduce

  1. SHA1: fa6023d
  2. Configure script: cat from my script below (Blake testbed) disables all tests but thyra, contains boiler plate for tpls and explicitly enables serial backend for kokkos and tpetra
# Load modules/env
module load devpack/20190329/openmpi/4.0.1/intel/19.3.199
module swap cmake/3.12.3 cmake/3.19.3

export MKL_FLAG="-mkl"
export BLAS_LIBRARIES="-mkl;${MKLROOT}/lib/intel64/libmkl_intel_lp64.a;${MKLROOT}/lib/intel64/libmkl_intel_thread.a;${MKLROOT}/lib/intel64/libmkl_core.a"
export LAPACK_LIBRARIES=${BLAS_LIBRARIES}

# Top Level Configuration Options
TESTS=OFF
EXAMPLES=OFF
SHARED=ON

CUDA=OFF
OPENMP=OFF
PTHREAD=OFF
SERIAL=ON
COMPLEX=ON

# Configure
cmake \
-DCMAKE_INSTALL_PREFIX="${TRILINOS_INSTALL_DIR}" \
-DCMAKE_CXX_STANDARD="14" \
-D Trilinos_ENABLE_COMPLEX_DOUBLE=${COMPLEX} \
\
-D Kokkos_ARCH_SKX=ON \
-D CMAKE_CXX_FLAGS="-g" \
-D CMAKE_C_FLAGS="${MKL_FLAG} -g" \
-D CMAKE_Fortran_FLAGS="${MKL_FLAG} -g" \
-D CMAKE_EXE_LINKER_FLAGS="${MKL_FLAG}" \
-D CMAKE_Fortran_COMPILER="mpif77" \
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-D Trilinos_ENABLE_DEBUG:BOOL=OFF \
\
-D Trilinos_ENABLE_INSTALL_CMAKE_CONFIG_FILES:BOOL=ON \
-D CMAKE_BUILD_TYPE:STRING=RELEASE \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \
-D CMAKE_SKIP_RULE_DEPENDENCY=ON \
-D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
-D BUILD_SHARED_LIBS:BOOL=${SHARED} \
-D DART_TESTING_TIMEOUT:STRING=500 \
\
-D Trilinos_ENABLE_OpenMP=${OPENMP} \
-D TPL_ENABLE_CUDA=${CUDA} \
-D TPL_ENABLE_MPI=ON \
  -D MPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-bind-to;socket;-map-by;socket" \
-D TPL_ENABLE_BLAS=ON \
  -D TPL_BLAS_LIBRARIES:PATH="${BLAS_LIBRARIES}" \
-D TPL_ENABLE_LAPACK=ON \
  -D TPL_LAPACK_LIBRARIES:PATH="${LAPACK_LIBRARIES}" \
-D TPL_ENABLE_Boost=ON \
   -D Boost_INCLUDE_DIRS:PATH="${BOOST_ROOT}/include" \
   -D Boost_LIBRARY_DIRS:PATH="${BOOST_ROOT}/lib" \
-D TPL_ENABLE_BoostLib=ON \
   -D BoostLib_INCLUDE_DIRS:PATH="${BOOST_ROOT}/include" \
   -D BoostLib_LIBRARY_DIRS:PATH="${BOOST_ROOT}/lib" \
-D TPL_ENABLE_Netcdf=ON \
  -D Netcdf_INCLUDE_DIRS:PATH="${NETCDF_ROOT}/include" \
  -D Netcdf_LIBRARY_DIRS:PATH="${NETCDF_ROOT}/lib" \
  -D TPL_Netcdf_LIBRARIES:PATH="${NETCDF_ROOT}/lib/libnetcdf.a;${HDF5_ROOT}/lib/libhdf5_hl.a;${HDF5_ROOT}/lib/libhdf5.a;${ZLIB_ROOT}/lib/libz.a;${PNETCDF_ROOT}/lib/libpnetcdf.a" \
  -D TPL_Netcdf_PARALLEL:BOOL=ON \
-D TPL_ENABLE_HDF5=ON \
  -D HDF5_INCLUDE_DIRS:PATH="${HDF5_ROOT}/include" \
  -D TPL_HDF5_LIBRARIES:PATH="${HDF5_ROOT}/lib/libhdf5_hl.a;${HDF5_ROOT}/lib/libhdf5.a;${ZLIB_ROOT}/lib/libz.a" \
-D TPL_ENABLE_Zlib=ON \
  -D Zlib_INCLUDE_DIRS:PATH="${ZLIB_ROOT}/include" \
  -D TPL_Zlib_LIBRARIES:PATH="${ZLIB_ROOT}/lib/libz.a" \
-D TPL_ENABLE_DLlib=ON \
\
-D Trilinos_ENABLE_TESTS=${TESTS} \
-D Trilinos_ENABLE_Kokkos=ON \
  -D Kokkos_ENABLE_SERIAL=${SERIAL} \
  -D Kokkos_ENABLE_PTHREAD=${PTHREAD} \
  -D Kokkos_ENABLE_OPENMP=${OPENMP} \
  -D Kokkos_ENABLE_CUDA=${CUDA} \
  -D Kokkos_ENABLE_CUDA_LAMBDA=${CUDA} \
-D Trilinos_ENABLE_Tpetra=ON \
  -D Tpetra_INST_SERIAL:BOOL=${SERIAL} \
  -D Tpetra_INST_OPENMP:BOOL=${OPENMP} \
  -D Tpetra_INST_PTHREAD:BOOL=${PTHREAD} \
  -D Tpetra_INST_CUDA:BOOL=${CUDA} \
  -D Tpetra_ENABLE_EXAMPLES:BOOL=${EXAMPLES} \
-D Trilinos_ENABLE_Thyra=ON \
  -D Thyra_ENABLE_TESTS:BOOL=ON \
$TRILINOS_DIR

Cross-referencing #10823 (comment) where similar issues were mentioned

@ndellingwood ndellingwood added type: bug The primary issue is a bug in Trilinos code or tests pkg: Thyra Issues primarily dealing with the Thyra Package labels Aug 8, 2022
@bartlettroscoe
Copy link
Member

That is very interesting. That error is also being reported in some PR builds as described in #10823 (comment) and #10823 (comment). But when I tried to reproduce the error in #10823 (comment) and #10823 (comment) I could not with the tip of 'develop'.

Can you try using a more recent version of 'develop' and see what happens?

In the meantime, I will see if I can reproduce this error on 'vortex' (where the PR tester is reporting an error in some PR builds).

@ndellingwood
Copy link
Contributor Author

Can you try using a more recent version of 'develop' and see what happens?

@bartlettroscoe yes, prepping the build now

@bartlettroscoe
Copy link
Member

I currently don't have access to 'blake' so I can't reproduce this failure. I will try on 'vortex' as per above.

@ndellingwood
Copy link
Contributor Author

@bartlettroscoe yes, prepping the build now

@bartlettroscoe oop, I have to hold off a bit, "no space left on device" message on blake. Will launch the build once disk space is cleared (I contacted an admin for help with this)

@bartlettroscoe bartlettroscoe changed the title Thyra: catastrophic error: cannot open source file "Trilinos_Util_CrsMatrixGallery.h" Thyra: EpetraOperatorWrapper_UnitTests.cpp missing "Trilinos_Util_CrsMatrixGallery.h" Aug 8, 2022
@bartlettroscoe bartlettroscoe self-assigned this Aug 8, 2022
@ndellingwood
Copy link
Contributor Author

Can "catastrophic error" type issues be triggered when disk space is low? If so, I suppose this could be a potential culprit for the intermittent issues with PR testing if the failures occur consistently on the same node(s)?

@bartlettroscoe bartlettroscoe added this to ToDo in Trilinos TriBITS Refactor via automation Aug 8, 2022
@bartlettroscoe bartlettroscoe moved this from ToDo to In Progress in Trilinos TriBITS Refactor Aug 8, 2022
@bartlettroscoe
Copy link
Member

Can "catastrophic error" type issues be triggered when disk space is low? If so, I suppose this could be a potential culprit for the intermittent issues with PR testing if the failures occur consistently on the same node(s)?

But why would it report a specific header file being missing? When you run out of disk space (with no swap space) usually the compiler aborts with no diagnostic feedback.

@ndellingwood
Copy link
Contributor Author

But why would it report a specific header file being missing? When you run out of disk space (with no swap space) usually the compiler aborts with no diagnostic feedback.

Yeah, I suppose it wouldn't make sense to report a specific header file being missing (I had forgotten the typical abort behavior when out of space)

@bartlettroscoe
Copy link
Member

Can "catastrophic error" type issues be triggered when disk space is low? If so, I suppose this could be a potential culprit for the intermittent issues with PR testing if the failures occur consistently on the same node(s)?

@ndellingwood, if you can rerun that exact same build for the same repo version on 'blake' without running out if disk space, then that would be very good to know. That might explain why so many PR builds are failing with that error.

But at face value, that error makes no sense. Those files have not changed in a long time as shown by the git commands:

$ git log-short --name-status -- packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp

b1be43979e6 "Revert "Merge branch 'master' of software.sandia.gov:/space/git/Trilinos""
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date:   Tue Sep 22 15:53:17 2015 -0600 (7 years ago)

M       packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp

...

and

$ git log-short --name-status packages/triutils/src/Trilinos_Util_CrsMatrixGallery.h

b1be43979e6 "Revert "Merge branch 'master' of software.sandia.gov:/space/git/Trilinos""
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date:   Tue Sep 22 15:53:17 2015 -0600 (7 years ago)

M       packages/triutils/src/Trilinos_Util_CrsMatrixGallery.h

So those files have not changed or moved in 7 years.

The only recent changes to Thyra or TriUtils are:

$ git log-short --name-status -- packages/thyra packages/triutils/

3c6f63e82b2 "Thyra: Use Teuchos vars to consistent ETI including for float and complex<float> (#10635, #362)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Thu Jul 7 09:09:52 2022 -0600 (4 weeks ago)

M       packages/thyra/CMakeLists.txt
M       packages/thyra/core/cmake/Thyra_Config.h.in
M       packages/thyra/core/example/operator_vector/exampleImplicitlyComposedLinearOperators.cpp
M       packages/thyra/core/example/operator_vector/sillyCgSolve_mpi.cpp
M       packages/thyra/core/example/operator_vector/sillyCgSolve_serial.cpp
M       packages/thyra/core/example/operator_vector/sillyPowerMethod_serial.cpp
M       packages/thyra/core/src/support/operator_vector/client_support/Thyra_UnitTestHelpers.hpp
M       packages/thyra/core/test/operator_solve/test_linear_op_with_solve.cpp
M       packages/thyra/core/test/operator_vector/test_composite_linear_ops.cpp
M       packages/thyra/core/test/operator_vector/test_product_space.cpp
M       packages/thyra/core/test/operator_vector/test_scalar_product.cpp
M       packages/thyra/core/test/operator_vector/test_std_ops.cpp

c6d9ca891ed "Remove ifdefs for SUN_CXX (#10636)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Thu Jul 7 06:31:57 2022 -0600 (5 weeks ago)

M       packages/thyra/adapters/epetra/test/test_epetra_adapters.cpp
M       packages/thyra/core/test/operator_vector/test_scalar_product.cpp

Very unlikely for those changes to impact the compile of the file thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp.

Also, PR builds passed when those changes were merged to 'develop'.

@bartlettroscoe
Copy link
Member

CC: @csiefer2, @jhux2

Note the matching internal issue TRILINOSHD-150.

@ndellingwood
Copy link
Contributor Author

if you can rerun that exact same build for the same repo version on 'blake' without running out if disk space, then that would be very good to know

@bartlettroscoe I reran the build a couple times, first was with -j20 to reproduce the error, then again with -j1 to see if that made a difference before posting the issue; in both cases the "catastrophic error" occurred.

I'll rebuild and update after the system's "No space left on device" issue is resolved (I encountered the "No space left on device" message when running git pull to update my develop branch)

@jjellio
Copy link
Contributor

jjellio commented Aug 8, 2022

May want to check that /tmp isn't full (that can mimic as a disk is full error). Lots of times, you can't clean up tmp though, because the files may be owned by others ....

Can also look at df -h

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 8, 2022

I gave my best attempt to reproduce the build error:

<trilinosDir>/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp(54): catastrophic error: cannot open source file "Trilinos_Util_CrsMatrixGallery.h"
  #include "Trilinos_Util_CrsMatrixGallery.h"
                                             ^

for the PR build:

  • ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables

on 'vortex' being consistently reported PR #10808 shown here with the exact versions claimed in the last few PR iterations here and here which are:

  • 10807-kokkos-kernels-cublas: 48a70bc from Fri Jul 29 06:00:09 2022 -0600
  • develop: e8a9b49 from Thu Jul 28 08:57:51 2022 -0600

and I got a successful build of the executable and ran the test successfully

$ cd packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/

$ make clean

$ time make NP=32  ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests
...

[340/340] Linking CXX executable packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.exe

real    2m18.149s
user    47m11.417s
sys     12m22.643s

$ ls -l ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.exe 
-rwxrwxr-x 1 rabartl rabartl 7220120 Aug  8 16:20 ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.exe

There is just nothing more that I can do to try to reproduce this failure.

Attempt to reproduce build error on 'vortex' reported in PR #10808 (failed) (click to expand)

Trying to exactly reproduce the build error for Thyra EpetraOperatorWrapper_UnitTests.cpp missing "Trilinos_Util_CrsMatrixGallery.h" on 'vortex' for the PR #10808 that consistantly reports the failure:

<trilinosDir>/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp(54): catastrophic error: cannot open source file "Trilinos_Util_CrsMatrixGallery.h"
  #include "Trilinos_Util_CrsMatrixGallery.h"
                                             ^

for the build:

  • ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables

on 'vortex' shown here.

The last two PR iterations in PR #10803 here and here reported the exact same commits for the two branches being merged together locally:

  • 10807-kokkos-kernels-cublas: 48a70bc from Fri Jul 29 06:00:09 2022 -0600
  • develop: e8a9b49 from Thu Jul 28 08:57:51 2022 -0600

So let's create a temp branch that merges those exact two commits togther and see if I can reproduce the build error.

So on 'vortex' I do:

$ ssh vortex

$ cd ~/Trilinos.base/Trilinos/

$ git fetch github

$ git checkout -b 10807-kokkos-kernels-cublas-pr-build 48a70bc056c
Switched to a new branch '10807-kokkos-kernels-cublas-pr-build'

$ git merge e8a9b49e567
Already up to date.

Interesting, the version on the branch 48a70bc is an ancestor of commit e8a9b49 so no merge was needed.

Now let's try and run a build of Thyra:

$ ssh vortex

$ cd /vscratch1/rabartl/Trilinos.base/BUILDS/PR/ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release/

$ cat load-env-and-cmake-frag-file.sh
export TRILINOS_DIR=$HOME/Trilinos.base/Trilinos # Must be set!
if [[ -e GenConfigSettings.cmake ]] ; then
  echo "Remvoing existing file GenConfigSettings.cmake ..."
  rm GenConfigSettings.cmake
fi
source $TRILINOS_DIR/packages/framework/GenConfig/gen-config.sh \
--cmake-fragment GenConfigSettings.cmake \
ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables \
--force \
"$@"

$ cat do-configure
if [[ -e CMakeCache.txt ]] ; then
  echo "Removing CMakeCache.txt ..."
  rm CMakeCache.txt
fi
if [[ -d CMakeFiles ]] ; then
  echo "Removing CMakeFiles ..."
  rm -r CMakeFiles
fi
cmake \
-G Ninja \
-C GenConfigSettings.cmake \
-D Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \
-D Trilinos_ENABLE_TESTS=ON \
-D Trilinos_TRACE_ADD_TEST=ON \
"$@" \
$TRILINOS_DIR

$ script load-env-and-cmake-frag-file.out

$ . load-env-and-cmake-frag-file.sh
Using system 'ats2' based on matching hostname 'vortex60'.
Overriding system to 'ats2' based on specification in build name 'ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables'.
Matched environment name 'cuda-10.1.243-gnu-8.3.1-spmpi-rolling' in build name 'ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables'.
Matched complete configuration 'ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables'
  for build name 'ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables'.
* CMake fragment file written to: /vscratch1/rabartl/Trilinos.base/BUILDS/PR/ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release/GenConfigSettings.cmake
...

$ module list

Currently Loaded Modules:
  1) StdEnv                        (S)   4) sparc-tools/tools/main        7) cuda/10.1.243                                    10) gcc/8.3.1
  2) sparc-tools/python/3.7.9            5) sparc-tools/taos/2020.09.04   8) lapack/3.8.0-gcc-4.9.3                           11) spectrum-mpi/rolling-release
  3) sparc-tools/exodus/2021.11.26       6) bsub-wrapper/1.0              9) sparc-dev/cuda-10.1.243_gcc-7.3.1_spmpi-rolling  12) cmake/3.18.0

  Where:
   S:  Module is Sticky, requires --force to unload or purge

$ time ./do-configure -DTrilinos_ENABLE_Thyra=ON &> configure.out

real    1m36.561s
user    0m53.608s
sys     0m44.091s

$ tail configure.out 
...
Total time to configure Trilinos: 1m24.047s
-- Configuring done
-- Generating done
-- Build files have been written to: /vscratch1/rabartl/Trilinos.base/BUILDS/PR/ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release

$ cd packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/

$ make clean

$ time make NP=32  ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests
...

[340/340] Linking CXX executable packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.exe

real    2m18.149s
user    47m11.417s
sys     12m22.643s

$ ls -l ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.exe 
-rwxrwxr-x 1 rabartl rabartl 7220120 Aug  8 16:20 ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.exe

And I was able to run this with:

$ bsub -J pr-build -W 6:00 -Is bash

$ ctest -VV -R ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests_MPI_4
...
1: Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name vortex3 and rank 0!
1: Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name vortex3 and rank 1!
1: Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name vortex3 and rank 3!
1: Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name vortex3 and rank 2!
1: 
1: ***
1: *** Unit test suite ...
1: ***
1: 
1: 
1: Sorting tests by group name then by the order they were added ... (time = 5.15e-07)
1: 
1: Running unit tests ...
1: 
1: 0. EpetraOperatorWrapper_basic_UnitTest ... [Passed] (0.0355 sec)
1: 
1: Total Time: 0.0355 sec
1: 
1: Summary: total = 1, run = 1, passed = 1, failed = 0
1: 
1: End Result: TEST PASSED
1: jsrun return value: 0
1/1 Test #1: ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests_MPI_4 ...   Passed    1.44 sec

The following tests passed:
        ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests_MPI_4

100% tests passed, 0 tests failed out of 1

Label Time Summary:
Thyra    =   5.78 sec*proc (1 test)

Total Test time (real) =   1.45 sec

So the build and test passed.

Darn, that did not reproduce the failure either.

@bartlettroscoe bartlettroscoe changed the title Thyra: EpetraOperatorWrapper_UnitTests.cpp missing "Trilinos_Util_CrsMatrixGallery.h" Thyra: EpetraOperatorWrapper_UnitTests.cpp can't open "Trilinos_Util_CrsMatrixGallery.h" Aug 8, 2022
@bartlettroscoe
Copy link
Member

FYI: I just realized that the error:

<trilinosDir>/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp(54): catastrophic error: cannot open source file "Trilinos_Util_CrsMatrixGallery.h"
  #include "Trilinos_Util_CrsMatrixGallery.h"
                                             ^

says that it cannot open the file, not that it can't find it.

Perhaps a /tmp write issue might impact that?

I am just out of ideas.

@ndellingwood
Copy link
Contributor Author

@jjellio thanks for the suggestion, /tmp is pretty empty but /home is quite low on space (96% full)

@bartlettroscoe thanks for the update and triage, I'll update tomorrow when I can rebuild with more disk space, hopefully I'll have some useful additional diagnostic info to share

@bartlettroscoe bartlettroscoe changed the title Thyra: EpetraOperatorWrapper_UnitTests.cpp can't open "Trilinos_Util_CrsMatrixGallery.h" Thyra: EpetraOperatorWrapper_UnitTests.cpp cannot open "Trilinos_Util_CrsMatrixGallery.h" Aug 8, 2022
@bartlettroscoe
Copy link
Member

Note my other two failed attempts to reproduce this build error with other compilers and other configurations in #10823 (comment) and #10823 (comment) as reported on CDash (see here).

@ndellingwood
Copy link
Contributor Author

Same failure with builds I launched on Blake overnight :(

@bartlettroscoe
Copy link
Member

Same failure with builds I launched on Blake overnight :(

Okay, I will request an account on 'blake' and I will try to reproduce there once I get access.

@jhux2
Copy link
Member

jhux2 commented Aug 10, 2022

@ndellingwood Oh sweet. I am able to reproduce the error on blake using your cmake script!!!

@jhux2
Copy link
Member

jhux2 commented Aug 10, 2022

@bartlettroscoe Sure enough, the compile line is missing the "include" of triutils:

VERBOSE=1 make
cd /ascldap/users/jhu/trilinos/build-issue-10842/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper &&
/ascldap/users/projects/x86-64-skylake/openmpi/4.0.1/intel/19.3.199/bin/mpicxx  -I/ascldap/users/jhu/trilinos/build-issue-10842
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/adapters/epetra/src -I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/interfaces/operator_vector/fundamental
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/interfaces/operator_vector/extended
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/support/operator_vector/client_support
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/support/operator_vector/adapter_support
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/interfaces/operator_solve/fundamental
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/interfaces/operator_solve/extended
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/support/operator_solve/client_support
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/interfaces/nonlinear/model_evaluator/fundamental
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/interfaces/nonlinear/solvers/fundamental
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/src/support/nonlinear/solvers/client_support
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/thyra/core/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/core/example/operator_vector
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/teuchos/core/src -I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/core/src
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/kokkos/core/src -I/ascldap/users/jhu/trilinos/Trilinos/packages/kokkos/core/src
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/kokkos -I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/parameterlist/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/parser/src -I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/comm/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/numerics/src -I/ascldap/users/jhu/trilinos/Trilinos/packages/rtop/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/rtop/src/interfaces -I/ascldap/users/jhu/trilinos/Trilinos/packages/rtop/src/support
-I/ascldap/users/jhu/trilinos/Trilinos/packages/rtop/src/ops_lib -I/ascldap/users/jhu/trilinos/Trilinos/packages/rtop/src/lapack
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/rtop/src -I/ascldap/users/jhu/trilinos/build-issue-10842/packages/epetra/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/epetra/src -I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/remainder/src
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/teuchos/remainder/src
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/teuchos/kokkoscompat/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/kokkoscompat/src
-I/ascldap/users/jhu/trilinos/build-issue-10842/packages/teuchos/kokkoscomm/src
-I/ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/kokkoscomm/src -isystem
/ascldap/users/projects/x86-64-skylake/boost/1.65.1/intel/19.3.199/include -g -xCORE-AVX512  -O3 -DNDEBUG -std=c++14 -o
CMakeFiles/ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests.dir/EpetraOperatorWrapper_UnitTests.cpp.o -c
/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp
/ascldap/users/jhu/trilinos/Trilinos/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp(54):
catastrophic error: cannot open source file "Trilinos_Util_CrsMatrixGallery.h" #include "Trilinos_Util_CrsMatrixGallery.h"

TriUtils is a required test dependency of adapters/epetra.

One thing I notice is that other packages (aztecoo) are using this syntax:

SET(TEST_REQUIRED_DEP_PACKAGES Triutils)

whereas thyra is using

TRIBITS_PACKAGE_DEFINE_DEPENDENCIES(
  LIB_REQUIRED_PACKAGES ThyraCore Epetra
  TEST_REQUIRED_PACKAGES Triutils
  )

@bartlettroscoe bartlettroscoe changed the title Thyra: EpetraOperatorWrapper_UnitTests.cpp cannot open "Trilinos_Util_CrsMatrixGallery.h" Thyra: EpetraOperatorWrapper_UnitTests.cpp cannot open "Trilinos_Util_CrsMatrixGallery.h" blocking all PRs changing Tpetra Aug 11, 2022
@bartlettroscoe bartlettroscoe pinned this issue Aug 11, 2022
@bartlettroscoe bartlettroscoe added this to ToDo in Trilinos TriBITS Refactor via automation Aug 11, 2022
@bartlettroscoe bartlettroscoe added TriBITS Issues with the TriBITS framework itself, not usage of the TriBITS framework and removed pkg: Thyra Issues primarily dealing with the Thyra Package labels Aug 11, 2022
@github-actions github-actions bot added this to Backlog in Tpetra Aug 11, 2022
@bartlettroscoe bartlettroscoe moved this from ToDo to In Progress in Trilinos TriBITS Refactor Aug 11, 2022
Tpetra automation moved this from Backlog to Done Aug 11, 2022
Trilinos TriBITS Refactor automation moved this from In Progress to Done Aug 11, 2022
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Aug 11, 2022
…rilinos#10842)

Mainly pulling in this TriBITS 'master' snapshot update to address trilinos#10842.
@bartlettroscoe
Copy link
Member

This is NOT fixed. Reopening.

Trilinos TriBITS Refactor automation moved this from Done to ToDo Aug 11, 2022
trilinos-autotester added a commit that referenced this issue Aug 12, 2022
…run-demo

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Add Trilinos install tests, test demo app, fix cmake --install, fix PR errors (#10774, #10810, #10842)
PR Author: bartlettroscoe
@bartlettroscoe bartlettroscoe moved this from ToDo to In Review in Trilinos TriBITS Refactor Aug 12, 2022
@bartlettroscoe
Copy link
Member

With the merge of PR #10813, this should be fixed.

If you look at the PR builds for PR #10775 that just started running after the merge of PR #10813, they seems to show these errors are gone now looking at CDash here. If look at the history for the build rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables, for example, here you see a build error and the not-run test ThyraEpetraAdapters_EpetraOperatorWrapper_UnitTests_MPI_4 every iteration except the last iteration started 4 hours ago at 9:16 AM MDT showing:

image

Therefore, I think this is ready to close.

But I will leave "In review" for a few days just to be sure.

@bartlettroscoe
Copy link
Member

NOTE: As shown in this query over last 2 days all the build errors on 'vortex' caused by this problem have disappeared (except for real failures for PR #10829), including for the PRs #10808 (see this query), #10802 (see this query), and #10775 (see this query).

That is sufficient evidence to close this issue.

Trilinos TriBITS Refactor automation moved this from In Review to Done Aug 15, 2022
@csiefer2 csiefer2 unpinned this issue Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TriBITS Issues with the TriBITS framework itself, not usage of the TriBITS framework type: bug The primary issue is a bug in Trilinos code or tests
Projects
Tpetra
  
Done
4 participants