Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Ifpack2_unit_tests_MPI_4 unit tests randomly failing in many ATDM and PR builds since at least 2021-08-30 #10016

Closed
bartlettroscoe opened this issue Dec 14, 2021 · 6 comments
Assignees
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Ifpack2 Primary Build Added by triager to mark failures affecting primary builds Secondary Build Added by triager to mark failures affecting secondary builds type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Dec 14, 2021

CC: @trilinos/ifpack2, @<triage-contact> (Trilinos <product-area-name> Triage Contact (or "Current ATDM contact"))

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the tests:

  • Ifpack2_unit_tests_MPI_4

in the builds:

  • PR-9483-test-Trilinos_pullrequest_clang_10.0.0-3559
  • PR-9483-test-Trilinos_pullrequest_gcc_7.2.0_debug-3527
  • PR-9483-test-Trilinos_pullrequest_gcc_7.2.0_debug-3591
  • PR-9627-test-Trilinos_pullrequest_cuda_10.1.105-2132
  • PR-9627-test-Trilinos_pullrequest_cuda_10.1.105_uvm_off-1129
  • PR-9660-test-Trilinos_pullrequest_gcc_7.2.0_debug-3528
  • PR-9660-test-Trilinos_pullrequest_gcc_7.2.0_debug-3538
  • PR-9676-test-Trilinos_pullrequest_clang_10.0.0-3585
  • PR-9691-test-Trilinos_pullrequest_clang_10.0.0-3641
  • PR-9691-test-Trilinos_pullrequest_gcc_7.2.0_debug-3614
  • PR-9691-test-Trilinos_pullrequest_gcc_7.2.0_debug-3648
  • PR-9758-test-Trilinos_pullrequest_gcc_7.2.0_debug-3747
  • PR-9768-test-Trilinos_pullrequest_clang_10.0.0-3765
  • PR-9773-test-rhel7_sems-clang-7.0.1-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-19
  • PR-9810-test-Trilinos_pullrequest_gcc_7.2.0_debug-3839
  • PR-9836-test-Trilinos_pullrequest_clang_10.0.0-3913
  • PR-9859-test-rhel7_sems-clang-7.0.1-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-49
  • PR-9859-test-rhel7_sems-clang-7.0.1-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-53
  • PR-9866-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-77
  • PR-9876-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-81
  • PR-9876-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-142
  • PR-9876-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-188
  • PR-9883-test-Trilinos_pullrequest_clang_10.0.0-3937
  • PR-9920-test-Trilinos_pullrequest_gcc_7.2.0_debug-4045
  • PR-9929-test-Trilinos_pullrequest_gcc_7.2.0_debug-4120
  • PR-9990-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-202
  • PR-9999-test-Trilinos_pullrequest_clang_10.0.0-4135
  • PR-Experimental-test-Trilinos_pullrequest_caraway-29
  • Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release
  • Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug
  • Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug
  • Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug

started failing on testing day 2021-08-30.

When the unit test Ifpack2Chebyshev_double_int_longlong_Test0_UnitTest fails it seems to be missing the tolerance by just a little as shown here showing:

p=3 | 17. Ifpack2Chebyshev_double_int_longlong_Test0_UnitTest ... 
p=3 |  Ifpack2::Version(): Ifpack2 VOTD
p=3 |  Test that code {prec.setParameters(params);} does not throw : passed
p=3 |  prec_dom_map_ptr = Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const>{ptr=0x1febde0,node=0x2005940,strong_count=11,weak_count=0} == mtx_dom_map_ptr = Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const>{ptr=0x1febde0,node=0x2005940,strong_count=11,weak_count=0} : passed
p=3 |  prec_rng_map_ptr = Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const>{ptr=0x1febde0,node=0x2005940,strong_count=11,weak_count=0} == mtx_rng_map_ptr = Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const>{ptr=0x1febde0,node=0x2005940,strong_count=11,weak_count=0} : passed
p=3 |  Comparing yview == twos() ... passed
p=3 |  Comparing yview == halfs() ... passed
p=3 |  Comparing yview == halfs() ... passed
p=3 |  
p=3 |  Check: rel_err(prec.getLambdaMaxForApply(), expectedLambdaMax)
p=3 |         = rel_err(1.93166, 1.98883) = 0.0287438
p=3 |           <= tol = 0.02 : FAILED
p=3 |  
p=3 |  Check: rel_err(prec.getLambdaMaxForApply(), expectedLambdaMax)
p=3 |         = rel_err(1.95856, 1.98883) = 0.0152212
p=3 |           <= tol = 0.02 : passed
p=3 |  NOTE: Unit test failed on all processes!
p=3 |  [FAILED]  (0.00334 sec) Ifpack2Chebyshev_double_int_longlong_Test0_UnitTest
p=3 |  Location: /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release/SRC_AND_BUILD/Trilinos/packages/ifpack2/test/unit_tests/Ifpack2_UnitTestChebyshev.cpp:65

It looks like other unit tests are randomly failing as well failing to meet the tolerance.

If you run this query and then click "Shown Matching Output" you can see by how much the tolerance is being missed in these various tests.

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

One should be able to reproduce this failure as described in:

and the system-specific instructions at:

Just log into any of the associated machines and copy and paste the full CDash build name <build-name> listed above and run commands like:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh <build-name>

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<package-name>=ON \
 $TRILINOS_DIR

$ make NP=16

$ <command-to-run-on-compute-node> ctest -j4

where <package-name> is any package that you want to enable to reproduce build and/or test results.

Again, for exact system-specific details on what commands to run to build and run tests, see:

If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Ifpack2 impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area Primary Build Added by triager to mark failures affecting primary builds Secondary Build Added by triager to mark failures affecting secondary builds labels Dec 14, 2021
@bartlettroscoe bartlettroscoe added this to Backlog in Build Triage via automation Dec 14, 2021
@bartlettroscoe
Copy link
Member Author

@trilinos/framework, note, I did not add the entries to the TrilinosATDMStatus/*.csv files for these tests. I think these failures need some more triaging.

Note that this has only caused 95 test failures over 32 different builds over the since 2021-08-30 as shown from the triaging script create_trilinos_github_test_failure_issue_driver.sh output:

Total number of nonpassing tests over all days = 95

Total number of unique nonpassing test/build pairs over all days = 32

Number of test names = 1

Number of build names = 32

Note that this caused at least 28 PR build iterations to fail so this should be triaged and fixed.

I noticed this while trying to find a clean version of Trilinos to do testing with for PR #9973 and TriBITSPub/TriBITS#433.

@cgcgcg cgcgcg self-assigned this Dec 14, 2021
@bartlettroscoe
Copy link
Member Author

NOTE: If you run this query and click "Show Matching Output" you can see by how much to tolerance is being missed by in all of these various test runs on one page. So either the solve tolerance needs to be tightened down or the checking tolerance needs to be loosened up.

@bartlettroscoe
Copy link
Member Author

Also note that another unit test failed as well that had a tolerance of 0.01 as shown in this query showing:

Site Build Name Test Name Status Time Proc Time Details Build Time Processors Matching Output
ride19 PR-9627-test-Trilinos_pullrequest_cuda_10.1.105_uvm_off-1129 Ifpack2_unit_tests_MPI_4 Failed 3s 410ms 13s 640ms Completed (Failed) 2021-08-30T21:31:34 MDT 4 xpectedLambdaMax) p=3 | = rel_err(1.96503, 1.98883) = 0.0119686 p=3 | <= tol = 0.01 : FAILED p=3 | p=3 | Check: rel_err(prec.getLambdaMaxForApply(), expectedLambdaMax) p=3 |
ride18 PR-9627-test-Trilinos_pullrequest_cuda_10.1.105-2132 Ifpack2_unit_tests_MPI_4 Failed 3s 870ms 15s 480ms Completed (Failed) 2021-08-30T21:31:26 MDT 4 xpectedLambdaMax) p=2 | = rel_err(1.96503, 1.98883) = 0.0119686 p=2 | <= tol = 0.01 : FAILED p=2 | p=2 | Check: rel_err(prec.getLambdaMaxForApply(), expectedLambdaMax) p=2 |

But those only occurred on 2021-08-30 and not sense so I think you can ignore those.

@bartlettroscoe
Copy link
Member Author

@trilinos/framework

NOTE: Even though there were 95 failures of this test in lots of PR iterations and in several ATDM Trilinos builds over 3+ months, no one, reported this (until I did and I don't count). This shows a gap in the current Trilinos testing and triaging efforts that such errors are not caught and reported sooner.

This suggests the need for another screening tool or process that looks at even a single failing test in a PR or nightly build and then constructs CDash queries that looks over all builds and over a longer period of time to see if there is a pattern. (This is what I did manually in this case.) In this case, I saw a tolerance that was missed by a small amount and I figured that this was not the first time such a failure was impacting the automated builds. (When you see a tolerance missed by a huge margin, that is more likely to be a serious bug or system issue and not just non-determinism causing significant differences in roundoff errors.)

@e10harvey
Copy link
Contributor

e10harvey commented Dec 14, 2021

@bartlettroscoe: Thanks for reporting this. This test failure is showing up in our weekly SecondaryATDM triaging monitor.

CC: @jwillenbring, @ZUUL42

@cgcgcg
Copy link
Contributor

cgcgcg commented Jan 28, 2022

Fixed by PR #10017

@cgcgcg cgcgcg closed this as completed Jan 28, 2022
Build Triage automation moved this from Backlog to Done Jan 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Ifpack2 Primary Build Added by triager to mark failures affecting primary builds Secondary Build Added by triager to mark failures affecting secondary builds type: bug The primary issue is a bug in Trilinos code or tests
Projects
Development

No branches or pull requests

3 participants