New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 failing in 'debug' builds on white/ride #2473
Comments
This test showing:
It passed yesterday in the same build as shown at: Looking at the history of this test on this build on 'white' in the query: it fails three other times on various days going back to 3/12/2018. This suggests non-deterministic behavior causing the test to randomly fail. Does this test cause some non-deterministic behavior about Anasazi or the underlying software being used? Could this be exposing a weakness in Trilinos software that could bite a user in a CUDA build? In any case, I think this test should be disabled for now on these CUDA debug builds so that we can promote this build @mhoemmen, what do you think about this? Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior? |
@bartlettroscoe wrote:
@hkthorn may have something to say, but I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests). |
One option is to leave these issues open with the label "Disabled Tests" and assign it to the Product Lead for the area. Who is the Product Lead for Anasazi? Is that @srajama1? |
Anasazi is a problem child that got stuck with a (linear solvers) family where it may not belong :). Yes, I am the lead. Let us wait for what @hkthorn says. I worry this might be exposing something non-deterministic underneath. |
These randomly failing tests triggered the following CDash error email for the newly promoted build ??? this morning. Can I go ahead and disable these randomly failing test in these builds? The tests will only be disabled for these builds and not others where the test is passing consistently. From: CDash [mailto:trilinos-regression@sandia.gov] A submission to CDash for the project Trilinos has failing tests. Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=3474500 Project: Trilinos Tests failing -CDash on testing.sandia.gov |
@bartlettroscoe Please do; thanks! |
@bartlettroscoe @srajama1 @mhoemmen Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks! |
From @hkthorn:
Okay, I will disable these failing tests. However, also note that we saw two new failing Anasazi tests for this build today shown in the below email. The first test Should we disable these tests as well? If not, does someone on the Linear Solvers area have time to triage these some more? We either need to fix the test or disable them (and then leave this issue as a reminder to fix them along with other approaches that we can consider to keep reminders of disabled tests).
|
If you look at the query: (which shows all of the failing Anasazi tests in the last two weeks that have not already been disabled (see #2455) or are not in the 'opt' builds on white/ride (see #2454)), you can see that the tests:
fail multiple times on various days in the two builds:
All three of these tests failed multiple days in the The test |
…ide (trilinos#2473) These tests randomly fail with massive diffs. Very strange behavior. See
…ide (trilinos#2473) These tests randomly fail with massive diffs. Very strange behavior. See Trilinos GitHub issue trilinos#2473 for history and more details.
Just realized that the @trilinos/framework team ran into these same randomly failing tests in #1393 and they resolved the issue by disabling those tests as well. So it looks like this is the right decision to disable these tests in the ATDM builds. But it also suggests that perhaps the problems with these tests should be studied more carefully or these tests just need to be disabled all together. That way, other people and projects will not run into these randomly failing tests over and over again. And if these are the only real tests for "ModelSolvers" in Anasazi, then perhaps that feature is not ready to be used by people and should be disabled by default as experimental code or something? Then we set up some build of Trilinos for all of this "Experimental" code so at least we know how it is doing. |
…ing-anasazi-tests Disable 3 Anasazi tests that randomly fail in debug builds on white/ride (#2473)
…evelop * 'develop' of https://github.com/trilinos/Trilinos: (560 commits) Disabling Stefan Boltzmann tests 1 and 2 due to an unresolved hang. Also, resetting the default problem size for Helmholtz to 16x16. Disable 3 Anasazi tests that randomly fail in debug builds on white/ride (trilinos#2473) TrilinosCouplings: Output iteration count Tpetra: use KokkosKernels addition (trilinos#2444) Tank solve and value correspond for all parameters TrilinosCouplings: OK. Now compiling TrilinosCouplings: More RTC updates Disabling failing test. Stokhos: Allow mean-based PCE preconditioner with double scalar type. (Painstakingly) reimplemented every tank equation individually. Now have solve and value working correctly together. TrilinosCouplings: Turning off file default output Kokkos: fix compilation for GCC 4.8.4 TrilinosCouplings: Adding block / RTC materials support to Tpetra example (take2) Kokkos: disable failing CUDA+DEBUG test TrilinosCouplings: Adding block / RTC materials support to Tpetra example adding doxygen for nd method Added comment Fixed warnings. Panzer: fix race condition in unit test exodus writer for CurlLaplacian example Fixed some problems in tank example. Solve and value are at least consistent when theta=1 ...
@bartlettroscoe @mhoemmen @srajama1 I have found the underlying issue in these tests. They use a Teuchos::SerialDenseMatrix, which is a serial object without MPI communication or implied synchronization of values. These matrices are randomized on each processor an then used to perform tests of the orthogonalization routines and modal solvers. Again, there is no explicit synchronization of Teuchos SDM objects, so when the randomization generates different matrices on different processors, the tests fail because the explicit expectations of the classes being tested, orthogonalization and modal solvers, are violated. I have a feeling this pattern might be in Belos as well. I will fix this today. |
@hkthorn Wow! Thanks for finding this; sounds tricky! |
@hkthorn, so this is a defect in the tests not the library code that users depend on? Let me know when you have merged the fix into the Trilinos 'develop' branch and then I will re-enable these tests and we will let them run in the ATDM builds of Trilinos. |
@bartlettroscoe @mhoemmen Absolutely, this is a defect in the design of the test. I will let you know when the fix is in Trilinos 'develop' branch so we can re-enable the tests for ATDM builds. |
The longstanding test failures for the ModalSolvers and OrthoManager have been tracked down to the randomization of Teuchos::SerialDenseMatrix objects in parallel. There is no expectation that calling random() on an object that is locally owned to one MPI process will result in a SerialDenseMatrix that has the SAME random numbers in it on every MPI processor. It's that easy. #2473
It looks like the test which showed:
Looking at the query: it looks like this test also failed on 'ride' in the same build on 4/3/2018 with the output:
We can keep an eye to see if this test fails again in this build or some other build. But if it does, we should likely disable this test for now. |
I'll give the test a look to see if there are any bad patterns there. I have merged the PR that fixes the testing for the OrthoManager and ModalSolvers: Thanks! |
…able-random-failing-anasazi-tests" (trilinos#2473) This reverts commit 2e9da0c, reversing changes made to c828f5a. The merge branch in PR trilinos#2517 should allow these tests to pass now.
The PR #2621 was merged that re-enables these tests. Now we wait and see how they run and if they fail or not in the coming days and weeks. I am removing the "Disabled Tests" label. |
NOTE: The test Therefore, all that is left for this current issue is to watch and see if we see any more random failures with the tests |
Looking at the recent history for these tests on CDash after 5/19/2018 (when the NETLIB BLAS and LAPACK got put back as described in #2454 (comment)) in the following queries:
We can see these tests did not fail a single time and it shows these tests running in the Therefore, this issue appears to be resolved. Closing as complete. |
While addressing issue trilinos#2473, I found other places where a random serial dense matrix was used and expected to be the same in parallel. The synchronization method that was used to address the issue in Anasazi has been moved to the Teuchos serial dense helpers file so that other packages can use this utility in the generation of tests. In particular, this utility needed to be integrated into the MVOP testers for Belos and Anasazi, as well as the Belos orthogonalization tester. The assumption that a call to generate a random variable will return the same value on all processors is false and could have unknown consequences for testing. While it is unknown if any random failures can be tracked to these changes at this time, previous issues with Anasazi have been caused by this bad assumption. So, it is better to fix it.
While addressing issue trilinos#2473, I found other places where a random serial dense matrix was used and expected to be the same in parallel. The synchronization method that was used to address the issue in Anasazi has been moved to the Teuchos serial dense helpers file so that other packages can use this utility in the generation of tests. In particular, this utility needed to be integrated into the MVOP testers for Belos and Anasazi, as well as the Belos orthogonalization tester. The assumption that a call to generate a random variable will return the same value on all processors is false and could have unknown consequences for testing. While it is unknown if any random failures can be tracked to these changes at this time, previous issues with Anasazi have been caused by this bad assumption. So, it is better to fix it.
CC: @trilinos/anasazi, @mhoemmen
Next Action Status
PR #2621 merged on 4/24/2018 that re-enables the tests
Anasazi_Epetra_ModalSolversTester_MPI_4
andAnasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4
. Tests ran and passed in all promoted ATDM Trilinos builds between 5/20/2018 and 6/7/2018.Description
The tests:
Anasazi_Epetra_ModalSolversTester_MPI_4
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4
failed in
Trilinos-atdm-hansen-shiller-cuda-debug
build on 'ride' as shown at:This build is targeted to be an auto PR build for Trilinos (see #2464) so we desire to clean up this build more quickly.
Intrestingly, these tests did not fail in what should be the idential
Trilinos-atdm-hansen-shiller-cuda-debug
build on the identical machine 'white' as shown at:Strangely, those tests did fail on
Trilinos-atdm-hansen-shiller-cuda-debug
build on 'white' yestrday shown at:A) Anasazi_Epetra_ModalSolversTester_MPI_4:
Test failing test
Anasazi_Epetra_ModalSolversTester_MPI_4
today with details shown at:showed the failure:
Looking at all of the builds today that ran that test shown at:
this test fails in the same way (i.e. a numerical problem) on the builds
Linux-gcc-4.8.4-MPI_RELEASE_12.12.1
andLinux-gcc-4.8.4-MPI_RELEASE_12.12.1_SHARED
on the machinehansel.sandia.gov
so this problem is not isolated to ATDM builds of Trilinos.Also note that this test failed for the ATDM builds
Trilinos-atdm-white-ride-gnu-opt-openmp
andTrilinos-atdm-white-ride-gnu-opt-openmp
with segfaults, but that is already being addressed by #2454 and is likely unrelated.B) Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4:
The failing test
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
today with details shown at:showed:
Looking at all of the builds today that ran that test shown at:
you can see that this test also failed in a similar (numerical) way in the builds
Linux-gcc-4.9.3-Sierra_MPI_release_DEV_ETI_SERIAL-ON_OPENMP-ON_PTHREAD-OFF_CUDA-OFF_COMPLEX-ON
andLinux-GCC-4.9.3-openmpi-1.8.7_Debug_DEV_Werror
so it looks like this problem is not isolated to ATDM builds of Trilinos. Note that one of those is a "Sierra' build of Trilinos.The text was updated successfully, but these errors were encountered: