Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the Trilinos-atdm-white-ride-cuda-debug build on 'white' #2920

Closed
bartlettroscoe opened this issue Jun 8, 2018 · 21 comments
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Belos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jun 8, 2018

CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)

Next Action Status

Disabled in build Trilinos-atdm-white-ride-cuda-debug in commit cc7fff2 pushed on 6/12/2018 and showed disabled and missing on CDash on 6/13/2018. PR #3546 merged on 10/2/2018 which re-enables tests that should be fixed from PR #3050 merged before. No new failures as of 12/3/2018!

Description

As shown in this rather complex query showing all failing Belos tests other than Belos_rcg_hb_MPI_4 in all promoted ATDM builds since 5/10/2018 the tests:

  • Belos_pseudo_stochastic_pcg_hb_0_MPI_4
  • Belos_pseudo_stochastic_pcg_hb_1_MPI_4

failed 5 times in total and appear to be randomly failing in the Trilinos-atdm-white-ride-cuda-debug build. (The other failing test shown was Belos_pseudo_pcg_hb_1_MPI_4 also for the Trilinos-atdm-white-ride-cuda-debug build but that only failed once yesterday so we will ignore that for now.) (The test Belos_rcg_hb_MPI_4 was excluded from the above query because it is addressed in #2919.)

Looking at the testing history for these tests Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 from 5/10/2018 through today 6/8/2018 in this less complex query one can see that these tests complete in about the same time in under 2 seconds when they pass or fail.

The output when these tests fail (such as shown for the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 yesterday on 6/7/2018 here) looks like:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (104, 1, Passed)
   Passed.......OR Combination -> 
     Failed.......Number of Iterations = 100 == 100
     Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 8.95881e-09 < 1e-08
                  residual [ 1 ] = 1.21989e-08 > 1e-08
                  residual [ 2 ] = 6.84374e-09 < 1e-08
                  residual [ 3 ] = 9.15804e-09 < 1e-08
                  residual [ 4 ] = 7.2567e-09 < 1e-08

Passed.......OR Combination -> 
  Failed.......Number of Iterations = 100 == 100
  Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 8.95881e-09 < 1e-08
               residual [ 1 ] = 1.21989e-08 > 1e-08
               residual [ 2 ] = 6.84374e-09 < 1e-08
               residual [ 3 ] = 9.15804e-09 < 1e-08
               residual [ 4 ] = 7.2567e-09 < 1e-08

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts    
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.06571 (101)    0.07122 (101)    0.07694 (101)    0.0007051 (101)       
Belos: Operation Prec*x                                  0.1014 (104)     0.108 (104)      0.1151 (104)     0.001039 (104)        
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.2159 (1)       0.216 (1)        0.2162 (1)       0.216 (1)             
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.0665 (102)     0.07206 (102)    0.07777 (102)    0.0007065 (102)       
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.101 (210)      0.1076 (210)     0.1147 (210)     0.0005122 (210)       
==================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	8.95881e-09
Problem 1 : 	1.21989e-08
Problem 2 : 	6.84374e-09
Problem 3 : 	9.15804e-09
Problem 4 : 	7.2567e-09

End Result: TEST FAILED

So this shows that the test fails due to the max iteration limit of 100 being reached before reaching the desired residual tolerance . The other failures for the tests Belos_pseudo_stochastic_pcg_hb_0_MPI_4 and Belos_pseudo_stochastic_pcg_hb_1_MPI_4 look to all be maxing out the number of iterations at 100.

When the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 passed the day before on 6/6/2018 as shown here showed output like:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (89, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 87 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 5.02551e-09 < 1e-08
                  residual [ 1 ] = 5.92159e-09 < 1e-08
                  residual [ 2 ] = 6.61897e-09 < 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 3.67011e-09 < 1e-08

Passed.......OR Combination -> 
  OK...........Number of Iterations = 87 < 100
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 5.02551e-09 < 1e-08
               residual [ 1 ] = 5.92159e-09 < 1e-08
               residual [ 2 ] = 6.61897e-09 < 1e-08
               residual [ 3 ] = 8.2598e-09 < 1e-08
               residual [ 4 ] = 3.67011e-09 < 1e-08

=================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs    MeanOverCallCounts    
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.0652 (88)      0.06892 (88)     0.07251 (88)    0.0007831 (88)        
Belos: Operation Prec*x                                  0.09675 (89)     0.1009 (89)      0.1101 (89)     0.001134 (89)         
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.195 (1)        0.195 (1)        0.195 (1)       0.195 (1)             
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.06596 (89)     0.06969 (89)     0.07333 (89)    0.0007831 (89)        
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09635 (180)    0.1006 (180)     0.1098 (180)    0.0005587 (180)       
=================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	5.02551e-09
Problem 1 : 	5.92159e-09
Problem 2 : 	6.61897e-09
Problem 3 : 	8.2598e-09
Problem 4 : 	3.67011e-09

End Result: TEST PASSED

which shows it converged in 87 iterations. I looked at several other instances when these tests passed and they all look to be converging in 87 iterations.

Is this non-deterministic behavior due to fact that this is "stochastic" code and therefore the behavior is truly random or is it due to the fact that the random seed is not set consistently, or is it due to non-deterministic behavior in the accumulations with the CUDA 8.0 threaded Kokkos implementation on this machine? The fact that the test seems to converge in 87 iterations when it passes suggests that this is not purposeful random behavior but is a result of some other undesired and unintended random behavior.

Steps to reproduce

Following the instructions at:

one might be able to reproduce this behavior on 'white' or 'ride' by cloning the Trilinos github repo, getting on the 'develop' branch and then doing:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
  $TRILINOS_DIR

$ make NP=16 

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

But given that this test looks to be randomly failing, it may be hard to reproduce this behavior locally.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Belos client: ATDM Any issue primarily impacting the ATDM project labels Jun 8, 2018
@bartlettroscoe
Copy link
Member Author

Given that the ATDM APP codes are unlikely to be using a (symmetric) pseudo-block stochastic CG solver and this is breaking already cleaned up and promoted builds on 'white' and 'ride', I think it would be reasonable to disable these tests for at least the CUDA builds on 'white' and 'ride'. (We can't have randomly failing tests in promoted builds (since there will soon be automating processes based on this build).

But if this is an important solver to some important customers, then someone should take the time to try to debug what is happening here in case this is a defect in the solver code and not just a defect in the test (like was causing the random test failures in #2473).

If we do disable theses tests for the CUDA builds on 'white' and 'ride', we will mark this issue with "Disabled Tests" and "Stalled" and leave it open as a reminder to address the problems.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 8, 2018

@bartlettroscoe The class documentation says "THIS CODE IS CURRENTLY EXPERIMENTAL. CAVEAT EMPTOR." I'd say, please feel free to disable the tests.

@srajama1
Copy link
Contributor

srajama1 commented Jun 8, 2018

@hkthorn : Can you comment on disabling (and fixing later) these tests on CUDA ?

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 12, 2018
…PI_4 for cuda-debug build on white/ride (trilinos#2920)

This test randomly fails in this build. See trilinos#2920.  This is also experimental
code so ATDM customers should not be using this solver hopefully.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 12, 2018
…PI_4 for cuda-debug build on white/ride (trilinos#2920)

This test randomly fails in this build. See trilinos#2920.  This is also experimental
code so ATDM customers should not be using this solver hopefully.
@bartlettroscoe
Copy link
Member Author

These tests were disabled for the build Trilinos-atdm-white-ride-cuda-debug in commit cc7fff2 pushed on 6/12/2018. You can see that these tests were disabled on CDash on 6/13/2018 for this build at:

which shows:

-- Setting default Belos_pseudo_stochastic_pcg_hb_0_MPI_4_DISABLE=ON
-- Setting default Belos_pseudo_stochastic_pcg_hb_1_MPI_4_DISABLE=ON
...
-- Belos_pseudo_stochastic_pcg_hb_0_MPI_4: NOT added test because Belos_pseudo_stochastic_pcg_hb_0_MPI_4_DISABLE='ON'!
-- Belos_pseudo_stochastic_pcg_hb_1_MPI_4: NOT added test because Belos_pseudo_stochastic_pcg_hb_1_MPI_4_DISABLE='ON'!

and those tests are missing from:

(NOTE: These tests are not shown as "missing" since the 'bsub' command crashed the day before and therefore CDash does not note this.)

But these tests are being enabled and run for other builds as shown, for example, in:

which shows:

-- Belos_pseudo_stochastic_pcg_hb_0_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!
-- Belos_pseudo_stochastic_pcg_hb_1_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!

@bartlettroscoe bartlettroscoe added Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason labels Jun 14, 2018
@bartlettroscoe
Copy link
Member Author

A shown in this updated query, the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 failed in the Trilinos-atdm-white-ride-gnu-debug-openmp build on 'white' today. The test output for the test in the Trilinos-atdm-white-ride-gnu-debug-openmp build today shows similar failure output seen before which was:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (104, 1, Passed)
   Passed.......OR Combination -> 
     Failed.......Number of Iterations = 100 == 100
     Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 8.95881e-09 < 1e-08
                  residual [ 1 ] = 1.21989e-08 > 1e-08
                  residual [ 2 ] = 6.84374e-09 < 1e-08
                  residual [ 3 ] = 9.15804e-09 < 1e-08
                  residual [ 4 ] = 7.2567e-09 < 1e-08

Passed.......OR Combination -> 
  Failed.......Number of Iterations = 100 == 100
  Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 8.95881e-09 < 1e-08
               residual [ 1 ] = 1.21989e-08 > 1e-08
               residual [ 2 ] = 6.84374e-09 < 1e-08
               residual [ 3 ] = 9.15804e-09 < 1e-08
               residual [ 4 ] = 7.2567e-09 < 1e-08

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts    
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.03653 (101)    0.03966 (101)    0.04296 (101)    0.0003927 (101)       
Belos: Operation Prec*x                                  0.09903 (104)    0.1038 (104)     0.1128 (104)     0.0009986 (104)       
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.1724 (1)       0.1725 (1)       0.1725 (1)       0.1725 (1)            
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.03693 (102)    0.04006 (102)    0.04337 (102)    0.0003928 (102)       
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09723 (210)    0.102 (210)      0.111 (210)      0.0004859 (210)       
==================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	8.95881e-09
Problem 1 : 	1.21989e-08
Problem 2 : 	6.84374e-09
Problem 3 : 	9.15804e-09
Problem 4 : 	7.2567e-09

End Result: TEST FAILED

See the max iterations reached:

     Failed.......Number of Iterations = 100 == 100

The previous day, that same test output in that same build shown here

showed:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (87, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 86 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 1.23203e-08 > 1e-08
                  residual [ 1 ] = 1.42423e-08 > 1e-08
                  residual [ 2 ] = 1.68138e-08 > 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 1.00805e-08 > 1e-08


Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (89, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 87 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 5.02551e-09 < 1e-08
                  residual [ 1 ] = 5.92159e-09 < 1e-08
                  residual [ 2 ] = 6.61897e-09 < 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 3.67011e-09 < 1e-08

Passed.......OR Combination -> 
  OK...........Number of Iterations = 87 < 100
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 5.02551e-09 < 1e-08
               residual [ 1 ] = 5.92159e-09 < 1e-08
               residual [ 2 ] = 6.61897e-09 < 1e-08
               residual [ 3 ] = 8.2598e-09 < 1e-08
               residual [ 4 ] = 3.67011e-09 < 1e-08

=================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs    MeanOverCallCounts    
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.0377 (88)      0.03982 (88)     0.04181 (88)    0.0004525 (88)        
Belos: Operation Prec*x                                  0.09774 (89)     0.106 (89)       0.1138 (89)     0.001191 (89)         
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.1785 (1)       0.1785 (1)       0.1786 (1)      0.1785 (1)            
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.03806 (89)     0.04024 (89)     0.04228 (89)    0.0004521 (89)        
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09607 (180)    0.1042 (180)     0.112 (180)     0.000579 (180)        
=================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	5.02551e-09
Problem 1 : 	5.92159e-09
Problem 2 : 	6.61897e-09
Problem 3 : 	8.2598e-09
Problem 4 : 	3.67011e-09

End Result: TEST PASSED

See the number of iterations:

  OK...........Number of Iterations = 87 < 100

So this test in the build Trilinos-atdm-white-ride-gnu-debug-openmp is showing these same random failures as the seen in the Trilinos-atdm-white-ride-cuda-debug build shown above.

Therefore, we need to disable these tests the build Trilinos-atdm-white-ride-gnu-debug-openmp as well.

@mhoemmen
Copy link
Contributor

I looked at the code -- it actually uses a random number, but on host, not on device.

@bartlettroscoe
Copy link
Member Author

I looked at the code -- it actually uses a random number, but on host, not on device.

@mhoemmen, is it not given the same seed so that each run of the test is the same? Or, is the behavior of the code in this test truly random (which makes it a problematic test in general)?

@bartlettroscoe
Copy link
Member Author

As shown in this updated query, the tests Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 is also randomoly failing in the build Trilinos-atdm-white-ride-cuda-opt showing it hitting the 100 iteration max when it does fail (for example as seen here and here). Therefore these tests need to be disabled for that build as well.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 18, 2018
…PI_4 in gnu-debug-openmp build on white/ride (trilinos#2920)

This failed with maxing out at 100 iterations today in the
Trilinos-atdm-white-ride-gnu-debug-openmp build where otherwise it converges
at 87 iterations.  Therefore, we are disabling this test like we did for the
cuda-debug build.  For more details see trilinos#2920.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 18, 2018
…PI_4 in cuda-opt build on white/ride (trilinos#2920)

This failed with maxing out at 100 iterations on 6/17/2018 the
Trilinos-atdm-white-ride-cupda-top build where otherwise it converges at 87
iterations.  Therefore, we are disabling this test like we did for some other
builds on white/ride.  For more details see trilinos#2920.
@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 18, 2018

@bartlettroscoe It uses rand which doesn't promise the same behavior on different compilers, hardware, etc. A poor choice of random coefficient could affect convergence.

We should actually fix the solver by using one of the C++11 random number generators that promises cross-platform identical behavior, given the same seed. I'm pretty sure this would be OK even for Windows Epetra-only MueLu builds, since the main issue with Visual Studio is its incomplete implementation of SFINAE.

@hkthorn
Copy link
Contributor

hkthorn commented Jul 2, 2018

@bartlettroscoe @mhoemmen Given that the single-vector CG solvers are just barely not converging on these test platforms,#2920 and #3007, I am increasing the maximum number of iterations by a little. There is nothing wrong with the solvers. This is not just due to the stochastic nature of this solver, the preconditioned CG solver has the same issue #3007. That is because, there is still randomization used to generate the right-hand side for the linear system being solved.

I'm going ahead with the modification to the CMakeLists.txt so we can close this issue and get the tests back up and running on white/ride.

@srajama1
Copy link
Contributor

srajama1 commented Jul 2, 2018

@hkthorn Thanks for taking care of this ! I am curious. Does random rhs has any benefit over a fixing the answer x (say to all ones) and then do A *x to arrive at the rhs ? That way we will workaround this issue altogether.

@hkthorn
Copy link
Contributor

hkthorn commented Jul 2, 2018

@srajama1 Fixing the answer to a constant value could be a bad choice for test matrices, it's dependent upon the eigenvalues/vectors. I don't want to do that. I have toyed with the idea of just creating a RHS vector for each matrix that we use, so we know the expected behavior of the solver and are not subject to platform-dependent randomization.

hkthorn added a commit to hkthorn/Trilinos that referenced this issue Jul 2, 2018
…os#2920 and trilinos#3007

The random right hand side vector used to perform these tests can sometimes require
a few more iterations of preconditioned CG than normal.  Normal is about 87 and the
maximum iterations is set at 100.  Some random right hand sides require 104 iterations.
It is not an issue with the solver, but the linear system being solved.  Thus, this
commit increases the maximum iterations to 110 to give the preconditioned CG solver
more iterations if the right hand side vector is not as easy to solve for.
@srajama1
Copy link
Contributor

srajama1 commented Jul 2, 2018

@hkthorn : I thought this was for one matrix. If it was multiple, then yes, I meant one answer for each. We can commit the rhs vector in another file, if needed.

@bartlettroscoe
Copy link
Member Author

We can commit the rhs vector in another file, if needed.

That would eliminate the different behavior on different platforms. Having tests that behave differently on different platform due to different behavior from the random number generator is generally not desirable behavior.

hkthorn added a commit that referenced this issue Jul 10, 2018
Increase maximum number of iterations to address issues #2920 and #3007
hkthorn added a commit to hkthorn/Trilinos that referenced this issue Oct 2, 2018
On July 2, 2018, the CMakeLists.txt was modified to increase the maximum number
of iterations for the preconditioned CG tests that were failing.  This modification
was made for both standard PCG and stochastic PCG.  This change appears to still
enable these tests to pass on white/ride.  Thus, I am removing those tests from
the environmental scripts that disable them.
@bartlettroscoe bartlettroscoe added the stage: in progress Work on the issue has started label Oct 2, 2018
@hkthorn hkthorn removed the Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason label Oct 2, 2018
trilinos-autotester added a commit that referenced this issue Oct 2, 2018
Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Re-enable Belos PCG tests (#2920 & #3007) on white/ride
PR Author: hkthorn
@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

Closing this issue. The tests have been re-enabled.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

Seriously, I'm closing this issue.

@hkthorn hkthorn closed this as completed Oct 2, 2018
@bartlettroscoe bartlettroscoe added stage: in review Primary work is completed and now is just waiting for human review and/or test feedback and removed stage: in progress Work on the issue has started Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue labels Oct 2, 2018
@bartlettroscoe
Copy link
Member Author

Reopening until we have confirmation on CDash. I have removed the Disabled Tests label and have updated the "Next Action Status" field to say we will watch this until 12/2/2018 and will close if we don't see any more random failures of these tests by that time.

NOTE: If these were not randomly failing tests and they failed every day before, then we could look at the CDash results tomorrow and then comment and close this issue then if these tests passed. But since these were random failures, we have to wait much much longer to get confirmation.

Make sense?

@bartlettroscoe
Copy link
Member Author

Reopening like I said I would :-)

@bartlettroscoe bartlettroscoe reopened this Oct 2, 2018
@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

Alrighty, I expect that you will close this bug then.

@bartlettroscoe
Copy link
Member Author

Alrighty, I expect that you will close this bug then.

@hkthorn, for now, yes. But in the future, someone responsible for the Trilinos Linear Solvers Product Area will do it as part of a better structured process.

searhein pushed a commit to searhein/Trilinos that referenced this issue Oct 3, 2018
…GDSW-VFEM-Coarse-Spaces

* 'develop' of https://github.com/trilinos/Trilinos: (186 commits)
  Tpetra: initial commit of user's guide (trilinos#3553)
  MueLu: fix master xml list problem
  PyTrilinos: Remove includes of NOX_Version.h
  Xpetra: update threshold type
  Re-enable Belos PCG tests (trilinos#2920 & trilinos#3007) on white/ride
  Phalanx: add debug output support during DAG traversal
  Switch to new devpack/20180521/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88 env (trilinos#3290)
  MueLu: Fix gold files
  MueLu gold files: Fix bad if statement in rebase scripts
  MueLu: Xpetra: add threshold for zero diag fix
  kokkos-kernels: Patch to fix trilinos#3493
  Tpetra::MultiVector: Improve unit tests (help w/ trilinos#3493)
  running update_params.sh to fix up latex
  Disable four exodus SEACAS tests failing with Not Run on mutrino (trilinos#3496) (trilinos#3530)
  PyTrilinos: Fix case-sensitive include of Ifpack_ConfigDef.h
  Stratimikos Belos adapter: On Belos error, throw Thyra::CatastrophicSolveFailure
  Stratimikos Belos adapter: When "Timer Label" is set, use it in output
  Amesos: SuperLU_DIST version fixes
  put in a new phase 3 aggregation option that avoids singletons at all costs (even groupng vertices with non-neighbors into aggregates)
  Framework: Parameterized CTest build update to enable extra packages to be set by Jenkins parameters. (trilinos#3520)
  ...
@bartlettroscoe bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Nov 29, 2018
@fryeguy52
Copy link
Contributor

fryeguy52 commented Dec 3, 2018

As per the next action status these have not failed and it is 12/3/2018 so I am marking this as closed.

here are links to the tests' histories:
Belos_pseudo_stochastic_pcg_hb_0_MPI_4
Belos_pseudo_stochastic_pcg_hb_1_MPI_4

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Dec 3, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
On July 2, 2018, the CMakeLists.txt was modified to increase the maximum number
of iterations for the preconditioned CG tests that were failing.  This modification
was made for both standard PCG and stochastic PCG.  This change appears to still
enable these tests to pass on white/ride.  Thus, I am removing those tests from
the environmental scripts that disable them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Belos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants