New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the Trilinos-atdm-white-ride-cuda-debug
build on 'white'
#2920
Comments
Given that the ATDM APP codes are unlikely to be using a (symmetric) pseudo-block stochastic CG solver and this is breaking already cleaned up and promoted builds on 'white' and 'ride', I think it would be reasonable to disable these tests for at least the CUDA builds on 'white' and 'ride'. (We can't have randomly failing tests in promoted builds (since there will soon be automating processes based on this build). But if this is an important solver to some important customers, then someone should take the time to try to debug what is happening here in case this is a defect in the solver code and not just a defect in the test (like was causing the random test failures in #2473). If we do disable theses tests for the CUDA builds on 'white' and 'ride', we will mark this issue with "Disabled Tests" and "Stalled" and leave it open as a reminder to address the problems. |
@bartlettroscoe The class documentation says "THIS CODE IS CURRENTLY EXPERIMENTAL. CAVEAT EMPTOR." I'd say, please feel free to disable the tests. |
@hkthorn : Can you comment on disabling (and fixing later) these tests on CUDA ? |
…PI_4 for cuda-debug build on white/ride (trilinos#2920) This test randomly fails in this build. See trilinos#2920. This is also experimental code so ATDM customers should not be using this solver hopefully.
…PI_4 for cuda-debug build on white/ride (trilinos#2920) This test randomly fails in this build. See trilinos#2920. This is also experimental code so ATDM customers should not be using this solver hopefully.
These tests were disabled for the build which shows:
and those tests are missing from: (NOTE: These tests are not shown as "missing" since the 'bsub' command crashed the day before and therefore CDash does not note this.) But these tests are being enabled and run for other builds as shown, for example, in: which shows:
|
A shown in this updated query, the test
See the max iterations reached:
The previous day, that same test output in that same build shown here showed:
See the number of iterations:
So this test in the build Therefore, we need to disable these tests the build |
I looked at the code -- it actually uses a random number, but on host, not on device. |
@mhoemmen, is it not given the same seed so that each run of the test is the same? Or, is the behavior of the code in this test truly random (which makes it a problematic test in general)? |
As shown in this updated query, the tests |
…PI_4 in gnu-debug-openmp build on white/ride (trilinos#2920) This failed with maxing out at 100 iterations today in the Trilinos-atdm-white-ride-gnu-debug-openmp build where otherwise it converges at 87 iterations. Therefore, we are disabling this test like we did for the cuda-debug build. For more details see trilinos#2920.
…PI_4 in cuda-opt build on white/ride (trilinos#2920) This failed with maxing out at 100 iterations on 6/17/2018 the Trilinos-atdm-white-ride-cupda-top build where otherwise it converges at 87 iterations. Therefore, we are disabling this test like we did for some other builds on white/ride. For more details see trilinos#2920.
@bartlettroscoe It uses We should actually fix the solver by using one of the C++11 random number generators that promises cross-platform identical behavior, given the same seed. I'm pretty sure this would be OK even for Windows Epetra-only MueLu builds, since the main issue with Visual Studio is its incomplete implementation of SFINAE. |
@bartlettroscoe @mhoemmen Given that the single-vector CG solvers are just barely not converging on these test platforms,#2920 and #3007, I am increasing the maximum number of iterations by a little. There is nothing wrong with the solvers. This is not just due to the stochastic nature of this solver, the preconditioned CG solver has the same issue #3007. That is because, there is still randomization used to generate the right-hand side for the linear system being solved. I'm going ahead with the modification to the CMakeLists.txt so we can close this issue and get the tests back up and running on white/ride. |
@hkthorn Thanks for taking care of this ! I am curious. Does random rhs has any benefit over a fixing the answer x (say to all ones) and then do A *x to arrive at the rhs ? That way we will workaround this issue altogether. |
@srajama1 Fixing the answer to a constant value could be a bad choice for test matrices, it's dependent upon the eigenvalues/vectors. I don't want to do that. I have toyed with the idea of just creating a RHS vector for each matrix that we use, so we know the expected behavior of the solver and are not subject to platform-dependent randomization. |
…os#2920 and trilinos#3007 The random right hand side vector used to perform these tests can sometimes require a few more iterations of preconditioned CG than normal. Normal is about 87 and the maximum iterations is set at 100. Some random right hand sides require 104 iterations. It is not an issue with the solver, but the linear system being solved. Thus, this commit increases the maximum iterations to 110 to give the preconditioned CG solver more iterations if the right hand side vector is not as easy to solve for.
@hkthorn : I thought this was for one matrix. If it was multiple, then yes, I meant one answer for each. We can commit the rhs vector in another file, if needed. |
That would eliminate the different behavior on different platforms. Having tests that behave differently on different platform due to different behavior from the random number generator is generally not desirable behavior. |
On July 2, 2018, the CMakeLists.txt was modified to increase the maximum number of iterations for the preconditioned CG tests that were failing. This modification was made for both standard PCG and stochastic PCG. This change appears to still enable these tests to pass on white/ride. Thus, I am removing those tests from the environmental scripts that disable them.
Closing this issue. The tests have been re-enabled. |
Seriously, I'm closing this issue. |
Reopening until we have confirmation on CDash. I have removed the NOTE: If these were not randomly failing tests and they failed every day before, then we could look at the CDash results tomorrow and then comment and close this issue then if these tests passed. But since these were random failures, we have to wait much much longer to get confirmation. Make sense? |
Reopening like I said I would :-) |
Alrighty, I expect that you will close this bug then. |
@hkthorn, for now, yes. But in the future, someone responsible for the Trilinos Linear Solvers Product Area will do it as part of a better structured process. |
…GDSW-VFEM-Coarse-Spaces * 'develop' of https://github.com/trilinos/Trilinos: (186 commits) Tpetra: initial commit of user's guide (trilinos#3553) MueLu: fix master xml list problem PyTrilinos: Remove includes of NOX_Version.h Xpetra: update threshold type Re-enable Belos PCG tests (trilinos#2920 & trilinos#3007) on white/ride Phalanx: add debug output support during DAG traversal Switch to new devpack/20180521/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88 env (trilinos#3290) MueLu: Fix gold files MueLu gold files: Fix bad if statement in rebase scripts MueLu: Xpetra: add threshold for zero diag fix kokkos-kernels: Patch to fix trilinos#3493 Tpetra::MultiVector: Improve unit tests (help w/ trilinos#3493) running update_params.sh to fix up latex Disable four exodus SEACAS tests failing with Not Run on mutrino (trilinos#3496) (trilinos#3530) PyTrilinos: Fix case-sensitive include of Ifpack_ConfigDef.h Stratimikos Belos adapter: On Belos error, throw Thyra::CatastrophicSolveFailure Stratimikos Belos adapter: When "Timer Label" is set, use it in output Amesos: SuperLU_DIST version fixes put in a new phase 3 aggregation option that avoids singletons at all costs (even groupng vertices with non-neighbors into aggregates) Framework: Parameterized CTest build update to enable extra packages to be set by Jenkins parameters. (trilinos#3520) ...
As per the next action status these have not failed and it is 12/3/2018 so I am marking this as closed. here are links to the tests' histories: |
On July 2, 2018, the CMakeLists.txt was modified to increase the maximum number of iterations for the preconditioned CG tests that were failing. This modification was made for both standard PCG and stochastic PCG. This change appears to still enable these tests to pass on white/ride. Thus, I am removing those tests from the environmental scripts that disable them.
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)
Next Action Status
Disabled in build
Trilinos-atdm-white-ride-cuda-debug
in commit cc7fff2 pushed on 6/12/2018 and showed disabled and missing on CDash on 6/13/2018. PR #3546 merged on 10/2/2018 which re-enables tests that should be fixed from PR #3050 merged before. No new failures as of 12/3/2018!Description
As shown in this rather complex query showing all failing Belos tests other than Belos_rcg_hb_MPI_4 in all promoted ATDM builds since 5/10/2018 the tests:
failed 5 times in total and appear to be randomly failing in the
Trilinos-atdm-white-ride-cuda-debug
build. (The other failing test shown wasBelos_pseudo_pcg_hb_1_MPI_4
also for theTrilinos-atdm-white-ride-cuda-debug
build but that only failed once yesterday so we will ignore that for now.) (The testBelos_rcg_hb_MPI_4
was excluded from the above query because it is addressed in #2919.)Looking at the testing history for these tests
Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4
from 5/10/2018 through today 6/8/2018 in this less complex query one can see that these tests complete in about the same time in under 2 seconds when they pass or fail.The output when these tests fail (such as shown for the test
Belos_pseudo_stochastic_pcg_hb_1_MPI_4
yesterday on 6/7/2018 here) looks like:So this shows that the test fails due to the max iteration limit of 100 being reached before reaching the desired residual tolerance . The other failures for the tests
Belos_pseudo_stochastic_pcg_hb_0_MPI_4
andBelos_pseudo_stochastic_pcg_hb_1_MPI_4
look to all be maxing out the number of iterations at 100.When the test
Belos_pseudo_stochastic_pcg_hb_1_MPI_4
passed the day before on 6/6/2018 as shown here showed output like:which shows it converged in 87 iterations. I looked at several other instances when these tests passed and they all look to be converging in 87 iterations.
Is this non-deterministic behavior due to fact that this is "stochastic" code and therefore the behavior is truly random or is it due to the fact that the random seed is not set consistently, or is it due to non-deterministic behavior in the accumulations with the CUDA 8.0 threaded Kokkos implementation on this machine? The fact that the test seems to converge in 87 iterations when it passes suggests that this is not purposeful random behavior but is a result of some other undesired and unintended random behavior.
Steps to reproduce
Following the instructions at:
one might be able to reproduce this behavior on 'white' or 'ride' by cloning the Trilinos github repo, getting on the 'develop' branch and then doing:
But given that this test looks to be randomly failing, it may be hard to reproduce this behavior locally.
The text was updated successfully, but these errors were encountered: