Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 failing in 'debug' builds on white/ride #2473

Closed
bartlettroscoe opened this issue Mar 28, 2018 · 21 comments
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Anasazi stage: in review Primary work is completed and now is just waiting for human review and/or test feedback type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 28, 2018

CC: @trilinos/anasazi, @mhoemmen

Next Action Status

PR #2621 merged on 4/24/2018 that re-enables the tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 . Tests ran and passed in all promoted ATDM Trilinos builds between 5/20/2018 and 6/7/2018.

Description

The tests:

  • Anasazi_Epetra_ModalSolversTester_MPI_4
  • Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
  • Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4

failed in Trilinos-atdm-hansen-shiller-cuda-debug build on 'ride' as shown at:

This build is targeted to be an auto PR build for Trilinos (see #2464) so we desire to clean up this build more quickly.

Intrestingly, these tests did not fail in what should be the idential Trilinos-atdm-hansen-shiller-cuda-debug build on the identical machine 'white' as shown at:

Strangely, those tests did fail on Trilinos-atdm-hansen-shiller-cuda-debug build on 'white' yestrday shown at:

A) Anasazi_Epetra_ModalSolversTester_MPI_4:

Test failing test Anasazi_Epetra_ModalSolversTester_MPI_4 today with details shown at:

showed the failure:

************* Householder Apply Test *************

             orthonorm error of V: 7.08978e-16
            orthonorm error of VQ: 0.375867
ERROR:  V*Q failed.
    orthonorm error of applyHouse: 0.375867
ERROR:  applyHouse failed.
        error(VQ - house(V,H,tau): 2.64481e-16

************* DirectSolver Test *************

Looking at all of the builds today that ran that test shown at:

this test fails in the same way (i.e. a numerical problem) on the builds Linux-gcc-4.8.4-MPI_RELEASE_12.12.1 and Linux-gcc-4.8.4-MPI_RELEASE_12.12.1_SHARED on the machine hansel.sandia.gov so this problem is not isolated to ATDM builds of Trilinos.

Also note that this test failed for the ATDM builds Trilinos-atdm-white-ride-gnu-opt-openmp and Trilinos-atdm-white-ride-gnu-opt-openmp with segfaults, but that is already being addressed by #2454 and is likely unrelated.

B) Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4:

The failing test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 today with details shown at:

showed:

Anasazi in Trilinos 12.13 (Dev)

 Generating Y1,Y2 for project() : testing... 
   || <Y1,Y1> - I || : 6.47718e-16
   || <Y2,Y2> - I || : 7.20309e-16
   || <X1,Y2> ||     : 1.64775e-16
   || <X1b,Y2> ||     : 6.9984e-15

p=3: *** Caught standard std::exception of type 'std::runtime_error' :

 /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/anasazi/epetra/test/OrthoManager/cxx_gentest.cpp:274:
 
 Throw number = 1
 
 Throw test that evaluated to true: err > TOL
 
 New X1 did not meet tolerance: orthog(X1,Y2) == 0.547032

Looking at all of the builds today that ran that test shown at:

you can see that this test also failed in a similar (numerical) way in the builds Linux-gcc-4.9.3-Sierra_MPI_release_DEV_ETI_SERIAL-ON_OPENMP-ON_PTHREAD-OFF_CUDA-OFF_COMPLEX-ON and Linux-GCC-4.9.3-openmpi-1.8.7_Debug_DEV_Werror so it looks like this problem is not isolated to ATDM builds of Trilinos. Note that one of those is a "Sierra' build of Trilinos.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Anasazi client: ATDM Any issue primarily impacting the ATDM project labels Mar 28, 2018
@bartlettroscoe
Copy link
Member Author

This test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 newly failed in the build Trilinos-atdm-white-ride-cuda-debug on 'white' today as shown at:

showing:

Generating Y1,Y2 for project() : testing... 
   || <Y1,Y1> - I || : 7.13673e-16
   || <Y2,Y2> - I || : 7.85286e-16
   || <X1,Y2> ||     : 1.71386e-16
   || <X1b,Y2> ||     : 7.10285e-15

p=1: *** Caught standard std::exception of type 'std::runtime_error' :

 /home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/anasazi/epetra/test/OrthoManager/cxx_gentest.cpp:274:
 
 Throw number = 1
 
 Throw test that evaluated to true: err > TOL
 
 New X1 did not meet tolerance: orthog(X1,Y2) == 0.356233

...

It passed yesterday in the same build as shown at:

Looking at the history of this test on this build on 'white' in the query:

it fails three other times on various days going back to 3/12/2018. This suggests non-deterministic behavior causing the test to randomly fail.

Does this test cause some non-deterministic behavior about Anasazi or the underlying software being used? Could this be exposing a weakness in Trilinos software that could bite a user in a CUDA build?

In any case, I think this test should be disabled for now on these CUDA debug builds so that we can promote this build Trilinos-atdm-white-ride-cuda-debug to the "ATDM" CDash Track/Group which opens the door to using it as an auto PR build for Trilinos (which will be huge for stabilizing Trilinos for ATDM customers). Then, someone can debug this test offline when they get some time.

@mhoemmen, what do you think about this? Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior?

@mhoemmen
Copy link
Contributor

@bartlettroscoe wrote:

Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior?

@hkthorn may have something to say, but I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests).

@bartlettroscoe
Copy link
Member Author

I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests).

@mhoemmen and @hkthorn,

One option is to leave these issues open with the label "Disabled Tests" and assign it to the Product Lead for the area. Who is the Product Lead for Anasazi? Is that @srajama1?

@srajama1
Copy link
Contributor

Anasazi is a problem child that got stuck with a (linear solvers) family where it may not belong :). Yes, I am the lead. Let us wait for what @hkthorn says.

I worry this might be exposing something non-deterministic underneath.

@bartlettroscoe
Copy link
Member Author

These randomly failing tests triggered the following CDash error email for the newly promoted build ??? this morning.

Can I go ahead and disable these randomly failing test in these builds? The tests will only be disabled for these builds and not others where the test is passing consistently.


From: CDash [mailto:trilinos-regression@sandia.gov]
Sent: Saturday, March 31, 2018 2:48 AM
To: Bartlett, Roscoe A rabartl@sandia.gov
Subject: FAILED (t=2): Trilinos/Anasazi - Trilinos-atdm-white-ride-gnu-debug-openmp - ATDM

A submission to CDash for the project Trilinos has failing tests.
You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=3474500

Project: Trilinos
SubProject: Anasazi
Site: white
Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp
Build Time: 2018-03-31T06:45:53 UTC
Type: ATDM
Tests failing: 2

Tests failing
Anasazi_Epetra_ModalSolversTester_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46065301&build=3474500)
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46065302&build=3474500)

-CDash on testing.sandia.gov

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 1, 2018

@bartlettroscoe Please do; thanks!

@hkthorn
Copy link
Contributor

hkthorn commented Apr 2, 2018

@bartlettroscoe @srajama1 @mhoemmen Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks!

@bartlettroscoe
Copy link
Member Author

From @hkthorn:

Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks!

Okay, I will disable these failing tests. However, also note that we saw two new failing Anasazi tests for this build today shown in the below email.

The first test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 was a segfault. The last two look to be diffs.

Should we disable these tests as well? If not, does someone on the Linear Solvers area have time to triage these some more? We either need to fix the test or disable them (and then leave this issue as a reminder to fix them along with other approaches that we can consider to keep reminders of disabled tests).


From: CDash [mailto:trilinos-regression@sandia.gov]
Sent: Tuesday, April 03, 2018 1:32 AM
To: Bartlett, Roscoe A
Subject: FAILED (t=3): Trilinos/Anasazi - Trilinos-atdm-white-ride-gnu-debug-
openmp - ATDM

A submission to CDash for the project Trilinos has failing tests.
You have been identified as one of the authors who have checked in changes
that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at
https://testing.sandia.gov/cdash/buildSummary.php?buildid=3480083

Project: Trilinos
SubProject: Anasazi
Site: ride
Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp
Build Time: 2018-04-03T07:30:22 UTC
Type: ATDM
Tests failing: 3

Tests failing
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
(https://testing.sandia.gov/cdash/testDetails.php?test=46173794&build=3480083)
Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4
(https://testing.sandia.gov/cdash/testDetails.php?test=46173795&build=3480083)
Anasazi_Epetra_LOBPCG_solvertest_MPI_4
(https://testing.sandia.gov/cdash/testDetails.php?test=46173813&build=3480083)

-CDash on testing.sandia.gov

@bartlettroscoe
Copy link
Member Author

If you look at the query:

(which shows all of the failing Anasazi tests in the last two weeks that have not already been disabled (see #2455) or are not in the 'opt' builds on white/ride (see #2454)), you can see that the tests:

  • Anasazi_Epetra_ModalSolversTester_MPI_4
  • Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
  • Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4

fail multiple times on various days in the two builds:

  • Trilinos-atdm-white-ride-cuda-debug
  • Trilinos-atdm-white-ride-gnu-debug-openmp

All three of these tests failed multiple days in the Trilinos-atdm-white-ride-cuda-debug build which is being targeted for an auto PR testing build (see #2464). Therefore, these should be disabled (as @hkthorn noted above).

The test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 only failed today in the build Trilinos-atdm-white-ride-gnu-debug-openmp as shown in the above query. Therefore, this might have been a fluke so we should not disable this yet.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 3, 2018
…ide (trilinos#2473)

These tests randomly fail with massive diffs.  Very strange behavior.  See
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 3, 2018
…ide (trilinos#2473)

These tests randomly fail with massive diffs.  Very strange behavior.  See
Trilinos GitHub issue trilinos#2473 for history and more details.
@bartlettroscoe
Copy link
Member Author

FYI: I created PR #2501 to disable these three randomly failing tests. I requested a review from @mhoemmen and/or @hkthorn.

@bartlettroscoe bartlettroscoe changed the title Tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 failing in Trilinos-atdm-white-ride-cuda-debug and other builds on 3/18/2018 Tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 failing in 'debug' builds on white/ride Apr 3, 2018
@bartlettroscoe
Copy link
Member Author

Just realized that the @trilinos/framework team ran into these same randomly failing tests in #1393 and they resolved the issue by disabling those tests as well. So it looks like this is the right decision to disable these tests in the ATDM builds.

But it also suggests that perhaps the problems with these tests should be studied more carefully or these tests just need to be disabled all together. That way, other people and projects will not run into these randomly failing tests over and over again. And if these are the only real tests for "ModelSolvers" in Anasazi, then perhaps that feature is not ready to be used by people and should be disabled by default as experimental code or something? Then we set up some build of Trilinos for all of this "Experimental" code so at least we know how it is doing.

bartlettroscoe added a commit that referenced this issue Apr 3, 2018
…ing-anasazi-tests

Disable 3 Anasazi tests that randomly fail in debug builds on white/ride (#2473)
@bartlettroscoe
Copy link
Member Author

The PR #2501 was merged just now merging the commit 2e9da0c. Therefore, we should see these three tests disabled for these builds white/ride tomorrow.

Putting this issue in review

@bartlettroscoe bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Apr 3, 2018
searhein pushed a commit to searhein/Trilinos that referenced this issue Apr 4, 2018
…evelop

* 'develop' of https://github.com/trilinos/Trilinos: (560 commits)
  Disabling Stefan Boltzmann tests 1 and 2 due to an unresolved hang. Also, resetting the default problem size for Helmholtz to 16x16.
  Disable 3 Anasazi tests that randomly fail in debug builds on white/ride (trilinos#2473)
  TrilinosCouplings: Output iteration count
  Tpetra: use KokkosKernels addition (trilinos#2444)
  Tank solve and value correspond for all parameters
  TrilinosCouplings: OK.  Now compiling
  TrilinosCouplings: More RTC updates
  Disabling failing test.
  Stokhos:  Allow mean-based PCE preconditioner with double scalar type.
  (Painstakingly) reimplemented every tank equation individually. Now have solve and value working correctly together.
  TrilinosCouplings: Turning off file default output
  Kokkos: fix compilation for GCC 4.8.4
  TrilinosCouplings: Adding block / RTC materials support to Tpetra example (take2)
  Kokkos: disable failing CUDA+DEBUG test
  TrilinosCouplings: Adding block / RTC materials support to Tpetra example
  adding doxygen for nd method
  Added comment
  Fixed warnings.
  Panzer: fix race condition in unit test exodus writer for CurlLaplacian example
  Fixed some problems in tank example. Solve and value are at least consistent when theta=1
  ...
@hkthorn
Copy link
Contributor

hkthorn commented Apr 4, 2018

@bartlettroscoe @mhoemmen @srajama1 I have found the underlying issue in these tests. They use a Teuchos::SerialDenseMatrix, which is a serial object without MPI communication or implied synchronization of values. These matrices are randomized on each processor an then used to perform tests of the orthogonalization routines and modal solvers. Again, there is no explicit synchronization of Teuchos SDM objects, so when the randomization generates different matrices on different processors, the tests fail because the explicit expectations of the classes being tested, orthogonalization and modal solvers, are violated. I have a feeling this pattern might be in Belos as well. I will fix this today.

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 4, 2018

@hkthorn Wow! Thanks for finding this; sounds tricky!

@bartlettroscoe
Copy link
Member Author

@hkthorn, so this is a defect in the tests not the library code that users depend on?

Let me know when you have merged the fix into the Trilinos 'develop' branch and then I will re-enable these tests and we will let them run in the ATDM builds of Trilinos.

@hkthorn
Copy link
Contributor

hkthorn commented Apr 4, 2018

@bartlettroscoe @mhoemmen Absolutely, this is a defect in the design of the test. I will let you know when the fix is in Trilinos 'develop' branch so we can re-enable the tests for ATDM builds.

hkthorn added a commit that referenced this issue Apr 5, 2018
The longstanding test failures for the ModalSolvers and OrthoManager have been
tracked down to the randomization of Teuchos::SerialDenseMatrix objects in parallel.
There is no expectation that calling random() on an object that is locally owned
to one MPI process will result in a SerialDenseMatrix that has the SAME random numbers
in it on every MPI processor.  It's that easy.

#2473
@bartlettroscoe
Copy link
Member Author

It looks like the test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 may also also have some random failures. We saw the following failure for this test in the build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white' on 4/18/2018:

which showed:

Anasazi in Trilinos 12.13 (Dev)

Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
Testing solver(2*nev,false) with generalized eigenproblem...
Testing solver(2*nev,true) with generalized eigenproblem...
[white25:127665] *** Process received signal ***
[white25:127665] Signal: Segmentation fault (11)
[white25:127665] Signal code: Address not mapped (1)
[white25:127665] Failing at address: 0x10024850020
[white25:127665] [ 0] [0x100000050478]
[white25:127665] [ 1] [0x3ff0000000000000]
[white25:127665] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 127665 on node white25 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Looking at the query:

it looks like this test also failed on 'ride' in the same build on 4/3/2018 with the output:


Anasazi in Trilinos 12.13 (Dev)

Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
[ride13:114533] *** Process received signal ***
[ride13:114533] Signal: Segmentation fault (11)
[ride13:114533] Signal code: Address not mapped (1)
[ride13:114533] Failing at address: 0x10036020010
[ride13:114533] [ 0] [0x100000050478]
[ride13:114533] [ 1] [0x3ff0000000000000]
[ride13:114533] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 114533 on node ride13 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

We can keep an eye to see if this test fails again in this build or some other build. But if it does, we should likely disable this test for now.

@hkthorn
Copy link
Contributor

hkthorn commented Apr 9, 2018

I'll give the test a look to see if there are any bad patterns there. I have merged the PR that fixes the testing for the OrthoManager and ModalSolvers:

#2517

Thanks!

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 23, 2018
…able-random-failing-anasazi-tests" (trilinos#2473)

This reverts commit 2e9da0c, reversing
changes made to c828f5a.

The merge branch in PR trilinos#2517 should allow these tests to pass now.
bartlettroscoe added a commit that referenced this issue Apr 24, 2018
…ests

Revert "Merge pull request #2501 from bartlettroscoe/2473-disable-random-failing-anasazi-tests" (#2473)

This will allow these tests to run again in these ATDM builds and then we can see if they pass or not.
@bartlettroscoe
Copy link
Member Author

The PR #2621 was merged that re-enables these tests. Now we wait and see how they run and if they fail or not in the coming days and weeks. I am removing the "Disabled Tests" label.

@bartlettroscoe
Copy link
Member Author

NOTE: The test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 that was randomly failing as described above is still randomly failing with a segfault, as recent as 2018-04-23. Therefore, since PR #2517 did not fix this test, we can assume it is unrelated to the other Anasazi tests covered in this issue. I created the new issue #2633 to address the issues with that test.

Therefore, all that is left for this current issue is to watch and see if we see any more random failures with the tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 ...

@bartlettroscoe
Copy link
Member Author

Looking at the recent history for these tests on CDash after 5/19/2018 (when the NETLIB BLAS and LAPACK got put back as described in #2454 (comment)) in the following queries:

We can see these tests did not fail a single time and it shows these tests running in the Trilinos-atdm-white-ride-gnu-debug-openmp and Trilinos-atdm-white-ride-cuda-debug builds.

Therefore, this issue appears to be resolved.

Closing as complete.

hkthorn added a commit to hkthorn/Trilinos that referenced this issue Oct 19, 2018
While addressing issue trilinos#2473, I found other places where a random serial dense
matrix was used and expected to be the same in parallel.  The synchronization
method that was used to address the issue in Anasazi has been moved to the
Teuchos serial dense helpers file so that other packages can use this utility
in the generation of tests.  In particular, this utility needed to be integrated
into the MVOP testers for Belos and Anasazi, as well as the Belos orthogonalization
tester.

The assumption that a call to generate a random variable will return the same
value on all processors is false and could have unknown consequences for testing.

While it is unknown if any random failures can be tracked to these changes at this
time, previous issues with Anasazi have been caused by this bad assumption.  So, it
is better to fix it.
@bartlettroscoe bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
While addressing issue trilinos#2473, I found other places where a random serial dense
matrix was used and expected to be the same in parallel.  The synchronization
method that was used to address the issue in Anasazi has been moved to the
Teuchos serial dense helpers file so that other packages can use this utility
in the generation of tests.  In particular, this utility needed to be integrated
into the MVOP testers for Belos and Anasazi, as well as the Belos orthogonalization
tester.

The assumption that a call to generate a random variable will return the same
value on all processors is false and could have unknown consequences for testing.

While it is unknown if any random failures can be tracked to these changes at this
time, previous issues with Anasazi have been caused by this bad assumption.  So, it
is better to fix it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Anasazi stage: in review Primary work is completed and now is just waiting for human review and/or test feedback type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants