Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white' #2920

bartlettroscoe · 2018-06-08T16:43:50Z

CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)

Next Action Status

Disabled in build Trilinos-atdm-white-ride-cuda-debug in commit cc7fff2 pushed on 6/12/2018 and showed disabled and missing on CDash on 6/13/2018. PR #3546 merged on 10/2/2018 which re-enables tests that should be fixed from PR #3050 merged before. No new failures as of 12/3/2018!

Description

As shown in this rather complex query showing all failing Belos tests other than Belos_rcg_hb_MPI_4 in all promoted ATDM builds since 5/10/2018 the tests:

Belos_pseudo_stochastic_pcg_hb_0_MPI_4
Belos_pseudo_stochastic_pcg_hb_1_MPI_4

failed 5 times in total and appear to be randomly failing in the Trilinos-atdm-white-ride-cuda-debug build. (The other failing test shown was Belos_pseudo_pcg_hb_1_MPI_4 also for the Trilinos-atdm-white-ride-cuda-debug build but that only failed once yesterday so we will ignore that for now.) (The test Belos_rcg_hb_MPI_4 was excluded from the above query because it is addressed in #2919.)

Looking at the testing history for these tests Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 from 5/10/2018 through today 6/8/2018 in this less complex query one can see that these tests complete in about the same time in under 2 seconds when they pass or fail.

The output when these tests fail (such as shown for the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 yesterday on 6/7/2018 here) looks like:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (104, 1, Passed)
   Passed.......OR Combination -> 
     Failed.......Number of Iterations = 100 == 100
     Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 8.95881e-09 < 1e-08
                  residual [ 1 ] = 1.21989e-08 > 1e-08
                  residual [ 2 ] = 6.84374e-09 < 1e-08
                  residual [ 3 ] = 9.15804e-09 < 1e-08
                  residual [ 4 ] = 7.2567e-09 < 1e-08

Passed.......OR Combination -> 
  Failed.......Number of Iterations = 100 == 100
  Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 8.95881e-09 < 1e-08
               residual [ 1 ] = 1.21989e-08 > 1e-08
               residual [ 2 ] = 6.84374e-09 < 1e-08
               residual [ 3 ] = 9.15804e-09 < 1e-08
               residual [ 4 ] = 7.2567e-09 < 1e-08

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts    
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.06571 (101)    0.07122 (101)    0.07694 (101)    0.0007051 (101)       
Belos: Operation Prec*x                                  0.1014 (104)     0.108 (104)      0.1151 (104)     0.001039 (104)        
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.2159 (1)       0.216 (1)        0.2162 (1)       0.216 (1)             
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.0665 (102)     0.07206 (102)    0.07777 (102)    0.0007065 (102)       
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.101 (210)      0.1076 (210)     0.1147 (210)     0.0005122 (210)       
==================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	8.95881e-09
Problem 1 : 	1.21989e-08
Problem 2 : 	6.84374e-09
Problem 3 : 	9.15804e-09
Problem 4 : 	7.2567e-09

End Result: TEST FAILED

So this shows that the test fails due to the max iteration limit of 100 being reached before reaching the desired residual tolerance . The other failures for the tests Belos_pseudo_stochastic_pcg_hb_0_MPI_4 and Belos_pseudo_stochastic_pcg_hb_1_MPI_4 look to all be maxing out the number of iterations at 100.

When the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 passed the day before on 6/6/2018 as shown here showed output like:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (89, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 87 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 5.02551e-09 < 1e-08
                  residual [ 1 ] = 5.92159e-09 < 1e-08
                  residual [ 2 ] = 6.61897e-09 < 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 3.67011e-09 < 1e-08

Passed.......OR Combination -> 
  OK...........Number of Iterations = 87 < 100
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 5.02551e-09 < 1e-08
               residual [ 1 ] = 5.92159e-09 < 1e-08
               residual [ 2 ] = 6.61897e-09 < 1e-08
               residual [ 3 ] = 8.2598e-09 < 1e-08
               residual [ 4 ] = 3.67011e-09 < 1e-08

=================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs    MeanOverCallCounts    
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.0652 (88)      0.06892 (88)     0.07251 (88)    0.0007831 (88)        
Belos: Operation Prec*x                                  0.09675 (89)     0.1009 (89)      0.1101 (89)     0.001134 (89)         
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.195 (1)        0.195 (1)        0.195 (1)       0.195 (1)             
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.06596 (89)     0.06969 (89)     0.07333 (89)    0.0007831 (89)        
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09635 (180)    0.1006 (180)     0.1098 (180)    0.0005587 (180)       
=================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	5.02551e-09
Problem 1 : 	5.92159e-09
Problem 2 : 	6.61897e-09
Problem 3 : 	8.2598e-09
Problem 4 : 	3.67011e-09

End Result: TEST PASSED

which shows it converged in 87 iterations. I looked at several other instances when these tests passed and they all look to be converging in 87 iterations.

Is this non-deterministic behavior due to fact that this is "stochastic" code and therefore the behavior is truly random or is it due to the fact that the random seed is not set consistently, or is it due to non-deterministic behavior in the accumulations with the CUDA 8.0 threaded Kokkos implementation on this machine? The fact that the test seems to converge in 87 iterations when it passes suggests that this is not purposeful random behavior but is a result of some other undesired and unintended random behavior.

Steps to reproduce

Following the instructions at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

one might be able to reproduce this behavior on 'white' or 'ride' by cloning the Trilinos github repo, getting on the 'develop' branch and then doing:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
  $TRILINOS_DIR

$ make NP=16 

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

But given that this test looks to be randomly failing, it may be hard to reproduce this behavior locally.

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-06-08T16:50:54Z

Given that the ATDM APP codes are unlikely to be using a (symmetric) pseudo-block stochastic CG solver and this is breaking already cleaned up and promoted builds on 'white' and 'ride', I think it would be reasonable to disable these tests for at least the CUDA builds on 'white' and 'ride'. (We can't have randomly failing tests in promoted builds (since there will soon be automating processes based on this build).

But if this is an important solver to some important customers, then someone should take the time to try to debug what is happening here in case this is a defect in the solver code and not just a defect in the test (like was causing the random test failures in #2473).

If we do disable theses tests for the CUDA builds on 'white' and 'ride', we will mark this issue with "Disabled Tests" and "Stalled" and leave it open as a reminder to address the problems.

mhoemmen · 2018-06-08T17:08:03Z

@bartlettroscoe The class documentation says "THIS CODE IS CURRENTLY EXPERIMENTAL. CAVEAT EMPTOR." I'd say, please feel free to disable the tests.

srajama1 · 2018-06-08T18:50:37Z

@hkthorn : Can you comment on disabling (and fixing later) these tests on CUDA ?

…PI_4 for cuda-debug build on white/ride (trilinos#2920) This test randomly fails in this build. See trilinos#2920. This is also experimental code so ATDM customers should not be using this solver hopefully.

bartlettroscoe · 2018-06-14T13:34:06Z

These tests were disabled for the build Trilinos-atdm-white-ride-cuda-debug in commit cc7fff2 pushed on 6/12/2018. You can see that these tests were disabled on CDash on 6/13/2018 for this build at:

https://testing-vm.sandia.gov/cdash/viewConfigure.php?buildid=3607499

which shows:

-- Setting default Belos_pseudo_stochastic_pcg_hb_0_MPI_4_DISABLE=ON
-- Setting default Belos_pseudo_stochastic_pcg_hb_1_MPI_4_DISABLE=ON
...
-- Belos_pseudo_stochastic_pcg_hb_0_MPI_4: NOT added test because Belos_pseudo_stochastic_pcg_hb_0_MPI_4_DISABLE='ON'!
-- Belos_pseudo_stochastic_pcg_hb_1_MPI_4: NOT added test because Belos_pseudo_stochastic_pcg_hb_1_MPI_4_DISABLE='ON'!

and those tests are missing from:

https://testing-vm.sandia.gov/cdash/viewTest.php?buildid=3607499

(NOTE: These tests are not shown as "missing" since the 'bsub' command crashed the day before and therefore CDash does not note this.)

But these tests are being enabled and run for other builds as shown, for example, in:

https://testing-vm.sandia.gov/cdash/viewConfigure.php?buildid=3607563

which shows:

-- Belos_pseudo_stochastic_pcg_hb_0_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!
-- Belos_pseudo_stochastic_pcg_hb_1_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!

bartlettroscoe · 2018-06-16T19:24:43Z

A shown in this updated query, the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 failed in the Trilinos-atdm-white-ride-gnu-debug-openmp build on 'white' today. The test output for the test in the Trilinos-atdm-white-ride-gnu-debug-openmp build today shows similar failure output seen before which was:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (104, 1, Passed)
   Passed.......OR Combination -> 
     Failed.......Number of Iterations = 100 == 100
     Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 8.95881e-09 < 1e-08
                  residual [ 1 ] = 1.21989e-08 > 1e-08
                  residual [ 2 ] = 6.84374e-09 < 1e-08
                  residual [ 3 ] = 9.15804e-09 < 1e-08
                  residual [ 4 ] = 7.2567e-09 < 1e-08

Passed.......OR Combination -> 
  Failed.......Number of Iterations = 100 == 100
  Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 8.95881e-09 < 1e-08
               residual [ 1 ] = 1.21989e-08 > 1e-08
               residual [ 2 ] = 6.84374e-09 < 1e-08
               residual [ 3 ] = 9.15804e-09 < 1e-08
               residual [ 4 ] = 7.2567e-09 < 1e-08

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts    
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.03653 (101)    0.03966 (101)    0.04296 (101)    0.0003927 (101)       
Belos: Operation Prec*x                                  0.09903 (104)    0.1038 (104)     0.1128 (104)     0.0009986 (104)       
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.1724 (1)       0.1725 (1)       0.1725 (1)       0.1725 (1)            
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.03693 (102)    0.04006 (102)    0.04337 (102)    0.0003928 (102)       
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09723 (210)    0.102 (210)      0.111 (210)      0.0004859 (210)       
==================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	8.95881e-09
Problem 1 : 	1.21989e-08
Problem 2 : 	6.84374e-09
Problem 3 : 	9.15804e-09
Problem 4 : 	7.2567e-09

End Result: TEST FAILED

See the max iterations reached:

     Failed.......Number of Iterations = 100 == 100

The previous day, that same test output in that same build shown here

showed:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (87, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 86 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 1.23203e-08 > 1e-08
                  residual [ 1 ] = 1.42423e-08 > 1e-08
                  residual [ 2 ] = 1.68138e-08 > 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 1.00805e-08 > 1e-08


Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (89, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 87 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 5.02551e-09 < 1e-08
                  residual [ 1 ] = 5.92159e-09 < 1e-08
                  residual [ 2 ] = 6.61897e-09 < 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 3.67011e-09 < 1e-08

Passed.......OR Combination -> 
  OK...........Number of Iterations = 87 < 100
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 5.02551e-09 < 1e-08
               residual [ 1 ] = 5.92159e-09 < 1e-08
               residual [ 2 ] = 6.61897e-09 < 1e-08
               residual [ 3 ] = 8.2598e-09 < 1e-08
               residual [ 4 ] = 3.67011e-09 < 1e-08

=================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs    MeanOverCallCounts    
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.0377 (88)      0.03982 (88)     0.04181 (88)    0.0004525 (88)        
Belos: Operation Prec*x                                  0.09774 (89)     0.106 (89)       0.1138 (89)     0.001191 (89)         
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.1785 (1)       0.1785 (1)       0.1786 (1)      0.1785 (1)            
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.03806 (89)     0.04024 (89)     0.04228 (89)    0.0004521 (89)        
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09607 (180)    0.1042 (180)     0.112 (180)     0.000579 (180)        
=================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	5.02551e-09
Problem 1 : 	5.92159e-09
Problem 2 : 	6.61897e-09
Problem 3 : 	8.2598e-09
Problem 4 : 	3.67011e-09

End Result: TEST PASSED

See the number of iterations:

  OK...........Number of Iterations = 87 < 100

So this test in the build Trilinos-atdm-white-ride-gnu-debug-openmp is showing these same random failures as the seen in the Trilinos-atdm-white-ride-cuda-debug build shown above.

Therefore, we need to disable these tests the build Trilinos-atdm-white-ride-gnu-debug-openmp as well.

mhoemmen · 2018-06-18T04:17:31Z

I looked at the code -- it actually uses a random number, but on host, not on device.

bartlettroscoe · 2018-06-18T13:21:51Z

I looked at the code -- it actually uses a random number, but on host, not on device.

@mhoemmen, is it not given the same seed so that each run of the test is the same? Or, is the behavior of the code in this test truly random (which makes it a problematic test in general)?

bartlettroscoe · 2018-06-18T13:45:13Z

As shown in this updated query, the tests Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 is also randomoly failing in the build Trilinos-atdm-white-ride-cuda-opt showing it hitting the 100 iteration max when it does fail (for example as seen here and here). Therefore these tests need to be disabled for that build as well.

…PI_4 in gnu-debug-openmp build on white/ride (trilinos#2920) This failed with maxing out at 100 iterations today in the Trilinos-atdm-white-ride-gnu-debug-openmp build where otherwise it converges at 87 iterations. Therefore, we are disabling this test like we did for the cuda-debug build. For more details see trilinos#2920.

…PI_4 in cuda-opt build on white/ride (trilinos#2920) This failed with maxing out at 100 iterations on 6/17/2018 the Trilinos-atdm-white-ride-cupda-top build where otherwise it converges at 87 iterations. Therefore, we are disabling this test like we did for some other builds on white/ride. For more details see trilinos#2920.

mhoemmen · 2018-06-18T16:13:06Z

@bartlettroscoe It uses rand which doesn't promise the same behavior on different compilers, hardware, etc. A poor choice of random coefficient could affect convergence.

We should actually fix the solver by using one of the C++11 random number generators that promises cross-platform identical behavior, given the same seed. I'm pretty sure this would be OK even for Windows Epetra-only MueLu builds, since the main issue with Visual Studio is its incomplete implementation of SFINAE.

hkthorn · 2018-07-02T17:03:59Z

@bartlettroscoe @mhoemmen Given that the single-vector CG solvers are just barely not converging on these test platforms,#2920 and #3007, I am increasing the maximum number of iterations by a little. There is nothing wrong with the solvers. This is not just due to the stochastic nature of this solver, the preconditioned CG solver has the same issue #3007. That is because, there is still randomization used to generate the right-hand side for the linear system being solved.

I'm going ahead with the modification to the CMakeLists.txt so we can close this issue and get the tests back up and running on white/ride.

srajama1 · 2018-07-02T17:08:28Z

@hkthorn Thanks for taking care of this ! I am curious. Does random rhs has any benefit over a fixing the answer x (say to all ones) and then do A *x to arrive at the rhs ? That way we will workaround this issue altogether.

hkthorn · 2018-07-02T17:14:18Z

@srajama1 Fixing the answer to a constant value could be a bad choice for test matrices, it's dependent upon the eigenvalues/vectors. I don't want to do that. I have toyed with the idea of just creating a RHS vector for each matrix that we use, so we know the expected behavior of the solver and are not subject to platform-dependent randomization.

…os#2920 and trilinos#3007 The random right hand side vector used to perform these tests can sometimes require a few more iterations of preconditioned CG than normal. Normal is about 87 and the maximum iterations is set at 100. Some random right hand sides require 104 iterations. It is not an issue with the solver, but the linear system being solved. Thus, this commit increases the maximum iterations to 110 to give the preconditioned CG solver more iterations if the right hand side vector is not as easy to solve for.

srajama1 · 2018-07-02T22:10:02Z

@hkthorn : I thought this was for one matrix. If it was multiple, then yes, I meant one answer for each. We can commit the rhs vector in another file, if needed.

bartlettroscoe · 2018-07-03T01:29:43Z

We can commit the rhs vector in another file, if needed.

That would eliminate the different behavior on different platforms. Having tests that behave differently on different platform due to different behavior from the random number generator is generally not desirable behavior.

Increase maximum number of iterations to address issues #2920 and #3007

On July 2, 2018, the CMakeLists.txt was modified to increase the maximum number of iterations for the preconditioned CG tests that were failing. This modification was made for both standard PCG and stochastic PCG. This change appears to still enable these tests to pass on white/ride. Thus, I am removing those tests from the environmental scripts that disable them.

Automatically Merged using Trilinos Pull Request AutoTester PR Title: Re-enable Belos PCG tests (#2920 & #3007) on white/ride PR Author: hkthorn

hkthorn · 2018-10-02T21:24:35Z

Closing this issue. The tests have been re-enabled.

hkthorn · 2018-10-02T21:24:45Z

Seriously, I'm closing this issue.

bartlettroscoe · 2018-10-02T21:32:23Z

Reopening until we have confirmation on CDash. I have removed the Disabled Tests label and have updated the "Next Action Status" field to say we will watch this until 12/2/2018 and will close if we don't see any more random failures of these tests by that time.

NOTE: If these were not randomly failing tests and they failed every day before, then we could look at the CDash results tomorrow and then comment and close this issue then if these tests passed. But since these were random failures, we have to wait much much longer to get confirmation.

Make sense?

bartlettroscoe · 2018-10-02T21:33:46Z

Reopening like I said I would :-)

hkthorn · 2018-10-02T22:01:46Z

Alrighty, I expect that you will close this bug then.

bartlettroscoe · 2018-10-03T00:10:39Z

Alrighty, I expect that you will close this bug then.

@hkthorn, for now, yes. But in the future, someone responsible for the Trilinos Linear Solvers Product Area will do it as part of a better structured process.

…GDSW-VFEM-Coarse-Spaces * 'develop' of https://github.com/trilinos/Trilinos: (186 commits) Tpetra: initial commit of user's guide (trilinos#3553) MueLu: fix master xml list problem PyTrilinos: Remove includes of NOX_Version.h Xpetra: update threshold type Re-enable Belos PCG tests (trilinos#2920 & trilinos#3007) on white/ride Phalanx: add debug output support during DAG traversal Switch to new devpack/20180521/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88 env (trilinos#3290) MueLu: Fix gold files MueLu gold files: Fix bad if statement in rebase scripts MueLu: Xpetra: add threshold for zero diag fix kokkos-kernels: Patch to fix trilinos#3493 Tpetra::MultiVector: Improve unit tests (help w/ trilinos#3493) running update_params.sh to fix up latex Disable four exodus SEACAS tests failing with Not Run on mutrino (trilinos#3496) (trilinos#3530) PyTrilinos: Fix case-sensitive include of Ifpack_ConfigDef.h Stratimikos Belos adapter: On Belos error, throw Thyra::CatastrophicSolveFailure Stratimikos Belos adapter: When "Timer Label" is set, use it in output Amesos: SuperLU_DIST version fixes put in a new phase 3 aggregation option that avoids singletons at all costs (even groupng vertices with non-neighbors into aggregates) Framework: Parameterized CTest build update to enable extra packages to be set by Jenkins parameters. (trilinos#3520) ...

fryeguy52 · 2018-12-03T17:59:23Z

As per the next action status these have not failed and it is 12/3/2018 so I am marking this as closed.

here are links to the tests' histories:
Belos_pseudo_stochastic_pcg_hb_0_MPI_4
Belos_pseudo_stochastic_pcg_hb_1_MPI_4

On July 2, 2018, the CMakeLists.txt was modified to increase the maximum number of iterations for the preconditioned CG tests that were failing. This modification was made for both standard PCG and stochastic PCG. This change appears to still enable these tests to pass on white/ride. Thus, I am removing those tests from the environmental scripts that disable them.

bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Belos client: ATDM Any issue primarily impacting the ATDM project labels Jun 8, 2018

bartlettroscoe added this to the Keep promoted "ATDM" builds of Trilinos clean milestone Jun 8, 2018

bartlettroscoe mentioned this issue Jun 8, 2018

Set up a CUDA build for an auto PR build #2464

Closed

bartlettroscoe added Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason labels Jun 14, 2018

bartlettroscoe mentioned this issue Jun 18, 2018

Disable some individual Kokkos and KokkosKernels tests on a few more full debug builds and disable the Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests on a few more platforms #2964

Merged

2 tasks

bartlettroscoe mentioned this issue Jun 21, 2018

Tests Belos_pseudo_pcg_hb_[0,1]_MPI_4 seeming fail with max iterations randomly in some builds on white/ride #3007

Closed

hkthorn mentioned this issue Jul 2, 2018

Increase maximum number of iterations to address issues #2920 and #3007 #3050

Merged

7 tasks

hkthorn added a commit that referenced this issue Jul 10, 2018

Merge pull request #3050 from hkthorn/develop

964412f

Increase maximum number of iterations to address issues #2920 and #3007

hkthorn mentioned this issue Oct 2, 2018

Re-enable Belos PCG tests (#2920 & #3007) on white/ride #3546

Merged

7 tasks

bartlettroscoe assigned hkthorn Oct 2, 2018

bartlettroscoe added the stage: in progress Work on the issue has started label Oct 2, 2018

hkthorn removed the Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason label Oct 2, 2018

trilinos-autotester added a commit that referenced this issue Oct 2, 2018

Merge Pull Request #3546 from hkthorn/Trilinos/develop

cd4aad1

Automatically Merged using Trilinos Pull Request AutoTester PR Title: Re-enable Belos PCG tests (#2920 & #3007) on white/ride PR Author: hkthorn

hkthorn closed this as completed Oct 2, 2018

bartlettroscoe reopened this Oct 2, 2018

bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Nov 29, 2018

fryeguy52 closed this as completed Dec 3, 2018

bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Dec 3, 2018

trilinos-autotester mentioned this issue Aug 24, 2021

Tpetra: Remove Distributor from pack/unpack interfaces #9567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white' #2920

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white' #2920

bartlettroscoe commented Jun 8, 2018 •

edited

bartlettroscoe commented Jun 8, 2018

mhoemmen commented Jun 8, 2018

srajama1 commented Jun 8, 2018

bartlettroscoe commented Jun 14, 2018

bartlettroscoe commented Jun 16, 2018

mhoemmen commented Jun 18, 2018

bartlettroscoe commented Jun 18, 2018

bartlettroscoe commented Jun 18, 2018

mhoemmen commented Jun 18, 2018 •

edited

hkthorn commented Jul 2, 2018

srajama1 commented Jul 2, 2018

hkthorn commented Jul 2, 2018

srajama1 commented Jul 2, 2018

bartlettroscoe commented Jul 3, 2018

hkthorn commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 3, 2018

fryeguy52 commented Dec 3, 2018 •

edited

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the Trilinos-atdm-white-ride-cuda-debug build on 'white' #2920

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the Trilinos-atdm-white-ride-cuda-debug build on 'white' #2920

Comments

bartlettroscoe commented Jun 8, 2018 • edited

Next Action Status

Description

Steps to reproduce

bartlettroscoe commented Jun 8, 2018

mhoemmen commented Jun 8, 2018

srajama1 commented Jun 8, 2018

bartlettroscoe commented Jun 14, 2018

bartlettroscoe commented Jun 16, 2018

mhoemmen commented Jun 18, 2018

bartlettroscoe commented Jun 18, 2018

bartlettroscoe commented Jun 18, 2018

mhoemmen commented Jun 18, 2018 • edited

hkthorn commented Jul 2, 2018

srajama1 commented Jul 2, 2018

hkthorn commented Jul 2, 2018

srajama1 commented Jul 2, 2018

bartlettroscoe commented Jul 3, 2018

hkthorn commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 3, 2018

fryeguy52 commented Dec 3, 2018 • edited

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white' #2920

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white' #2920

bartlettroscoe commented Jun 8, 2018 •

edited

mhoemmen commented Jun 18, 2018 •

edited

fryeguy52 commented Dec 3, 2018 •

edited