Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOX: Treat Exception as Solve Failure #1608

Closed
jmgate opened this issue Aug 15, 2017 · 45 comments
Closed

NOX: Treat Exception as Solve Failure #1608

jmgate opened this issue Aug 15, 2017 · 45 comments
Assignees
Labels
pkg: LOCA LOCA inside of NOX package pkg: NOX pkg: Stratimikos pkg: Thyra Issues primarily dealing with the Thyra Package type: enhancement Issue is an enhancement, not a bug

Comments

@jmgate
Copy link
Contributor

jmgate commented Aug 15, 2017

Charon has run into an issue where in the midst of a LOCA continuation run a preconditioner winds up throwing an exception. It would be ideal for this instance to be treated as a solve failure such that LOCA could back up, decrease the step size, and keep on going. We may be able to accomplish this by adding some exception handling logic to NOX::Thyra::Group::updateLOWS(), and then modify that routine to return something that will eventually indicate a solve failure.
@trilinos/nox

@jmgate jmgate added type: enhancement Issue is an enhancement, not a bug pkg: LOCA LOCA inside of NOX package pkg: NOX labels Aug 15, 2017
@jmgate jmgate self-assigned this Aug 15, 2017
@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Email from @jmgate to @etphipp:

Hey Eric,

Suzey Gao has a Charon test case she’s trying to run with a LOCA sweep. There’s a block LDU preconditioner operator getting built because her example uses a current constraint on some terminal of the device. In the midst of creating the preconditioner, it realizes the Schur complement is singular and throws an exception. Apparently at that point Charon quits and shows you the exception instead of cutting down the LOCA step size and trying again. Suzey’s current workaround is to restart the simulation using the solution from the last successful step as an initial guess, and then using the initial step size again (as opposed to the max step size). Do we need to be doing something such that we catch and handle this exception that’s getting thrown such that LOCA can cut the step size down and then keep on chugging?

Many thanks,

Jason

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Response:

Hi Jason,

So who is throwing the exception? Is it the preconditioning package (Teko, I guess)? The natural thing to do here is have LOCA catch the exception and treat it as a failed nonlinear solve step (in which case it would automatically reduce the step size and try again). However there is no logic in LOCA currently to do that. I would have to look at the code a little bit to see how easy/hard that would be to do in LOCA.

-Eric

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Response:

Teko is throwing the exception:

Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone<charon::
Schur2x2PreconditionerFactory, charon::Schur2x2PreconditionerFactory>"
*** THROWN EXCEPTION ***
/home/xngao/Program/Trilinos/packages/thyra/core/src/support/operator_solve/client_support/
Thyra_DefaultSerialDenseLinearOpWithSolve_def.hpp:198:
Throw number = 2
Throw test that evaluated to true: (dim) != (rank)
Error, (dim = 1) != (rank = 0)!

Then Charon is catching that and rethrowing with a more useful message:

Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone<charon::
Schur2x2PreconditionerFactory, charon::Schur2x2PreconditionerFactory>"
*** THROWN EXCEPTION ***
/home/xngao/Program/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:164:
Throw number = 3
Throw test that evaluated to true: true
Schur2x2PreconditionerFactory::buildPreconditionerOperator(): I'm afraid it looks like S is not invertible.

Any other information you need?

Jason

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Response:

Well, you could try adding this patch to LOCA:

diff --git a/packages/nox/src-loca/src/LOCA_Stepper.C b/packages/nox/src-loca/src/LOCA_Stepper.C
index d31052626d..24301d2d7c 100644
--- a/packages/nox/src-loca/src/LOCA_Stepper.C
+++ b/packages/nox/src-loca/src/LOCA_Stepper.C
@@ -588,7 +588,13 @@ LOCA::Stepper::compute(LOCA::Abstract::Iterator::StepStatus stepStatus)
printStartStep();
// Compute next point on continuation curve

  • solverStatus = solverPtr->solve();
  • try {
  • solverStatus = solverPtr->solve();
  • }
  • catch(...) {
  • // Treat any un-caught exception as a solver failure
  • solverStatus = NOX::StatusTest::Failed;
  • }
    // Check solver status
    if (solverStatus == NOX::StatusTest::Failed) {

It adds a try { ... } block around the solve and catches any previously un-caught exception, treating it as a solver failure. That's a bit extreme, because some exceptions you might want to pass along, although I have no idea how you would determine that. It's also not clear to me if LOCA should be doing this, or if this is something that should be done inside NOX.

Roger, do you have any thoughts on that?

-Eric

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

From @rppawlo:

I’m surprised that any steps are successful. Are you able to take multiple steps past the initial step before this failure occurs? The failure looks like a size check in Teko objects:

Throw test that evaluated to true: (dim) != (rank)

We should really dig into Teko and see what is happening here. Its possible that the LOCA augmented system is being seen by Teko as a block system and it is trying to invert the blocks.

My preference is to fix the preconditioner. If Teko can’t invert the blocks then that is an exceptional case. I don’t think LOCA or NOX should try to recover from this at the status test level.

Roger

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Response:

In Jason's original email it sounded like this was happening somewhere in the middle of a continuation run. I am guessing from the exception message that Teko has a 1x1 block that is zero, or zero to some tolerance in a rank calculation.

-Eric

I don't think it is possible that Teko is getting an augmented system created by LOCA, since LOCA can't create augmented systems in the form Teko would accept. It may be getting an augmented system created by Charon for the current constraint.

-Eric

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Response:

Yup, this is in the middle of a continuation run from 0 to 30 with an initial step size of 0.01. LOCA ramps up to a step size of 1 and gets as far as 10.5 for the continuation parameter before we run into this problem.

Specifically, Charon’s using a block LDU preconditioner in cases with a current constraint. If A = {{F, U}, {L, G}} is the blocked system, it computes S = G - L F^{-1} U, but then it tries to invert S when it’s singular. I was catching the exception thrown by Teko and rethrowing with a more useful error message so I could know what was going on. Any ideas as to what I should do instead? I suppose at the very least I could catch the Teko exception and return the identity as the preconditioner and we could just see what happens. Problem is this particular test case runs for a good eight hours before the problem manifests, so just trying things to see what works is very time consuming.

Thanks for the help,

Jason

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

At this point I'm waiting on Suzey Gao to return from vacation so I can get my hands on her actual example. Once I have it I can start experimenting with catching the exception in NOX::Thyra::Group::updateLOWS() and figuring out how we can use that to indicate a solve failure. I'll submit a pull request once I have something working so we can decide if it's the right way to go.

@rppawlo
Copy link
Contributor

rppawlo commented Aug 15, 2017

So after our discussion yesterday with Teko team, I think what we really need is to add a generic exception in Thyra or Stratimikos that all preconditioners can use to throw if they encounter a catastrophic failure. Then we could put into the NOX or LOCA a catch on this particular exception that allows LOCA to cut the continuation step and restart the solver. The reason I would like a specific exception is that we need to differentiate this from all other exceptions where the entire simulation should truly terminate (don't want to waste compute cycles running garbage for other exceptions). Does this sound reasonable @jmgate @eric-c-cyr @etphipp @egphill ?

@jmgate
Copy link
Contributor Author

jmgate commented Aug 15, 2017

Yes, that sounds good to me.

@jmgate jmgate removed their assignment Aug 15, 2017
@jmgate
Copy link
Contributor Author

jmgate commented Aug 16, 2017

@rppawlo, do you have a guess as to how long it might take to build in this generic exception to Thyra/Stratimikos? Just need to figure out if I tell Charon to wait a week or give everyone a patch to tide them over.

@jmgate
Copy link
Contributor Author

jmgate commented Sep 5, 2017

Is adding this generic exception to @trilinos/thyra or @trilinos/stratimikos something I should take on? I don't really have any experience contributing to either package.

@bartlettroscoe
Copy link
Member

Is the issue that a preconditioner is applied as Thyra::LinearOpBase::apply()? That function has no way of failing (it is assumed to always pass). So would you need an exception called something like Thyra::LinearOpApplyFailed which would mean that Thyra::LinearOpBase::apply() really should have been able to compute the application of the linear operator but could not for some reason (e.g. max num iterations exceeded). I think you want a specific name like this so you don't accidentally catch some other exception that should bring the program down.

It seems like that might not be too hard to add. But the problem is that various solver packages would need to be upgraded to to respond to that exception in a logical way. But if you only upgrading LOCA to respond to this then this would provide value.

@jmgate
Copy link
Contributor Author

jmgate commented Nov 6, 2017

@rppawlo, any ETA on a preconditioner catastrophic failure exception in Thyra or Stratimikos that NOX or LOCA can then catch?

@bartlettroscoe
Copy link
Member

All that needs to be done on the Thyra side is to define the exception class Thyra::LinearOpApplyFailed and then document it in the Thyra::LinearOpBase::apply() function documentation. Then subclasses of Thyra::LinearOpBase need to throw an exception of that type and clients of Thyra::LinearOpBase need to catch exceptions of that type.

@jmgate, can you take a stab at updating the Thyra_OperatorVectorTypes.hpp file to add that new exception type (see other examples there) and then update the file Thyra_LinearOpBase_decl.hpp file to document that exception class in the documentation for the functioin Thyra::LinearOpBase::apply()? Then I can review it.

@jmgate
Copy link
Contributor Author

jmgate commented Nov 6, 2017

Yup, I'll give it a shot.

@jmgate jmgate self-assigned this Nov 6, 2017
@jmgate jmgate added pkg: Thyra Issues primarily dealing with the Thyra Package pkg: Stratimikos stage: in progress Work on the issue has started labels Nov 6, 2017
@jmgate
Copy link
Contributor Author

jmgate commented Nov 9, 2017

Added the exception—working on documentation…

jmgate added a commit to jmgate/Trilinos that referenced this issue Nov 21, 2017
Added the LinearOpApplyFailed exception.
jmgate added a commit to jmgate/Trilinos that referenced this issue Nov 21, 2017
Added documentation of Thyra::Exceptions::LinearOpApplyFailed to
LinearOpBase::apply().
jmgate added a commit to jmgate/Trilinos that referenced this issue Nov 21, 2017
Modified LOCA_Stepper to catch a Thyra::Exceptions::LinearOpApplyFailed
and treat that as a solver failure.
@bartlettroscoe
Copy link
Member

We really need the stack trace to see what is going on. Can you reproduce from a restart?

If you have an working version of the BinUtils library, you should be able to turn on stack tracking so when that final exception is caught, you can see the stack trace. See:

@jmgate
Copy link
Contributor Author

jmgate commented Feb 2, 2018

This problem may not actually exist. I was able to run the example in question and the entire continuation run completed successfully without throwing any exceptions. I've asked @suzeygao to do a clean build pointing to the same commits I'm looking at to see if we can reproduce the problem. Sorry to have bothered you all.

@jmgate
Copy link
Contributor Author

jmgate commented Mar 2, 2018

Sorry it's taken me so long to get back to this. I was able to reproduce the problem Suzey's seeing, and was able to reproduce it from a restart so we don't have to wait 30 hours per run to debug. I obtained the stracktraces below by running

$ gdb --args ../../driver/charon_mp.exe --i=input.xml --current
(gdb) set pagination off
(gdb) catch throw
(gdb) commands
>backtrace
>continue
>end
(gdb) run

This spits out a stacktrace any time throw is called. The ones below are the ones that appear at the end of the run before Charon quits.

Stacktraces (click to expand)

Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2acab130, tinfo=0xa053e68 , dest=0x732f0e2 ) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2acab130, tinfo=0xa053e68 , dest=0x732f0e2 ) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x00000000083fe166 in Thyra::AmesosLinearOpWithSolveFactory::initializeOp (this=0x28a25fa0, fwdOpSrc=..., Op=0x29f62e10, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/stratimikos/adapters/amesos/src/Thyra_AmesosLinearOpWithSolveFactory.cpp:344
#2  0x0000000007e908fb in Teko::SolveInverseFactory::rebuildInverse (this=0x28a23ee0, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_SolveInverseFactory.cpp:152
#3  0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#4  0x0000000005876503 in charon::Schur2x2PreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, op=..., state=...) at /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:134
#5  0x0000000007e4bdf0 in Teko::BlockPreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, lo=..., state=...) at /workspace/Trilinos/packages/teko/src/Teko_BlockPreconditionerFactory.cpp:68
#6  0x0000000007e66cc8 in Teko::PreconditionerFactory::initializePrec (this=0x292f1c28, ASrc=..., prec=0x2acaac00, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerFactory.cpp:117
#7  0x0000000007e84593 in Teko::PreconditionerInverseFactory::rebuildInverse (this=0x28a21440, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerInverseFactory.cpp:241
#8  0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#9  0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#10 0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#11 0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#12 0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#13 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#14 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#15 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#16 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#17 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#18 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#19 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#20 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#21 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#22 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#23 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#24 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#25 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#26 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#27 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#28 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#29 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
 Teko: "rebuildInverse" could not construct the inverse operator using "Thyra::AmesosLinearOpWithSolveFactory{solverType=Klu}"

*** THROWN EXCEPTION ***
/workspace/Trilinos/packages/stratimikos/adapters/amesos/src/Thyra_AmesosLinearOpWithSolveFactory.cpp:346:

Throw number = 1

Throw test that evaluated to true: 0!=err

Error, NumericFactorization() on amesos solver of type 'Amesos_Klu'
returned error code -22!


Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2aa8aac0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 _ZNSt9exceptionD1Ev@plt) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75 {
#0 __cxxabiv1::__cxa_throw (obj=0x2aa8aac0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 _ZNSt9exceptionD1Ev@plt) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1 0x0000000007e4e722 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:193
#2 0x0000000005876503 in charon::Schur2x2PreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, op=..., state=...) at /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:134
#3 0x0000000007e4bdf0 in Teko::BlockPreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, lo=..., state=...) at /workspace/Trilinos/packages/teko/src/Teko_BlockPreconditionerFactory.cpp:68
#4 0x0000000007e66cc8 in Teko::PreconditionerFactory::initializePrec (this=0x292f1c28, ASrc=..., prec=0x2acaac00, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerFactory.cpp:117
#5 0x0000000007e84593 in Teko::PreconditionerInverseFactory::rebuildInverse (this=0x28a21440, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerInverseFactory.cpp:241
#6 0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#7 0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#8 0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#9 0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#10 0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#11 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#12 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#13 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#14 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#15 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#16 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#17 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#18 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#19 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#20 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#21 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#22 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#23 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#24 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#25 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#26 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#27 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2ac9b0b0, tinfo=0x98089a8 , dest=0x587c132 NOX::Exceptions::SolverFailure::~SolverFailure()) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75 {
#0 __cxxabiv1::__cxa_throw (obj=0x2ac9b0b0, tinfo=0x98089a8 , dest=0x587c132 NOX::Exceptions::SolverFailure::~SolverFailure()) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1 0x0000000005877074 in charon::Schur2x2PreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, op=..., state=...) at /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:137
#2 0x0000000007e4bdf0 in Teko::BlockPreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, lo=..., state=...) at /workspace/Trilinos/packages/teko/src/Teko_BlockPreconditionerFactory.cpp:68
#3 0x0000000007e66cc8 in Teko::PreconditionerFactory::initializePrec (this=0x292f1c28, ASrc=..., prec=0x2acaac00, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerFactory.cpp:117
#4 0x0000000007e84593 in Teko::PreconditionerInverseFactory::rebuildInverse (this=0x28a21440, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerInverseFactory.cpp:241
#5 0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#6 0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#7 0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#8 0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#9 0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#10 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#11 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#12 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#13 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#14 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#15 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#16 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#17 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#18 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#19 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#20 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#21 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#22 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#23 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#24 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#25 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#26 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone<charon::Schur2x2PreconditionerFactory, charon::Schur2x2PreconditionerFactory>"

*** THROWN EXCEPTION ***
/workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:139:

Throw number = 2

Throw test that evaluated to true: true

Schur2x2PreconditionerFactory::buildPreconditionerOperator(): I'm afraid it looks like F is not invertible.


Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2aca3ab0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 _ZNSt9exceptionD1Ev@plt) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75 {
#0 __cxxabiv1::__cxa_throw (obj=0x2aca3ab0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 _ZNSt9exceptionD1Ev@plt) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1 0x0000000007e4e722 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:193
#2 0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#3 0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#4 0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#5 0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#6 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#7 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#8 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#9 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#10 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#11 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#12 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#13 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#14 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#15 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#16 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#17 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#18 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#19 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#20 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#21 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#22 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
SOLVE FAILURE: std::exception

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 14.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[Thread 0x7fffed50e700 (LWP 16865) exited]
[Thread 0x7ffff7fde880 (LWP 16855) exited]
[Inferior 1 (process 16855) exited with code 016]
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-21.el7.x86_64 glibc-2.17-196.el7_4.2.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libcurl-7.29.0-42.el7_4.1.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-11.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 nspr-4.13.1-1.0.el7_3.x86_64 nss-3.28.4-15.el7_4.x86_64 nss-softokn-freebl-3.28.3-8.el7_4.x86_64 nss-util-3.28.4-3.el7.x86_64 openldap-2.4.44-5.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64
(gdb)


I'm afraid this leaves me puzzled. Where is this std::exception being thrown?

@rppawlo, @eric-c-cyr, @etphipp

@jmgate jmgate reopened this Mar 2, 2018
@jmgate
Copy link
Contributor Author

jmgate commented Mar 7, 2018

I think I see what's going on here. It looks like there are two other occurrences of solverPtr->solve() in LOCA_Stepper.C that should perhaps be wrapped in the same sort of try/catch block that was our original solution to this problem. I'll try that out, and if it works I'll submit a PR against Trilinos.

@jmgate
Copy link
Contributor Author

jmgate commented Mar 8, 2018

@etphipp, whenever you're in, could you give me a rundown of how this Stepper class works?

@etphipp
Copy link
Contributor

etphipp commented Mar 8, 2018 via email

@jmgate
Copy link
Contributor Author

jmgate commented Mar 14, 2018

Apparently I was running this case from a restart incorrectly. I fired off the restart run correctly, with what I'm hoping are the same parameters that would've existed at the end of the failed run, and this has been running fine for days now. I'm afraid I'm going to have to try the original run from scratch in gdb and see what happens, but that'll take another few days to get to the point of failure.

@jmgate
Copy link
Contributor Author

jmgate commented Mar 19, 2018

I'm afraid I can't debug this case. Running gdb on eight cores, we get a MPI_ABORT almost immediately. Running gdb on a single core, the simulation never triggers the exception that's thrown running on eight cores without gdb. Instead LOCA starts the continuation run, eventually starts decreasing the step size until it hits the minimum step size, and then gives up. Without the ability to reproduce this exception in a debugger, I don't think there's anything we can do about it.

@jmgate jmgate closed this as completed Mar 19, 2018
@jmgate jmgate removed the stage: in progress Work on the issue has started label Mar 19, 2018
@jmgate jmgate reopened this Mar 29, 2018
@jmgate
Copy link
Contributor Author

jmgate commented Mar 29, 2018

So I FINALLY was able to reproduce this problem in a debugger.

Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2957c920, tinfo=0xc4e0340 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4a7a5c0 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2957c920, tinfo=0xc4e0340 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4a7a5c0 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x000000000838c7a4 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:193
#2  0x00000000083d32f2 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x22ec4e00, fwdOpSrc=..., prec=0x25b8bb20, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#3  0x00000000083d28ee in Teko::StratimikosFactory::initializePrec (this=0x22ec4e00, fwdOpSrc=..., prec=0x25b8bb20, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#4  0x0000000007ad30aa in NOX::Thyra::Group::updateLOWS (this=0x25559df0) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:906
#5  0x0000000007ad17ed in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25559df0, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:752
#6  0x0000000007ad12ac in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25559df0, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:656
#7  0x00000000079aa668 in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x29f0a450, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#8  0x0000000007a2d8d8 in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff9190, params=..., op=..., B=..., C=..., F=0x29f4d740, G=0x29f03e90, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#9  0x00000000079ab7cd in LOCA::BorderedSolver::Bordering::applyInverse (this=0x29f4e4c0, params=..., F=0x29f4d740, G=0x29f03e90, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#10 0x000000000795b250 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x29f089d0, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#11 0x000000000795a2f8 in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x29f089d0, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#12 0x00000000079649ae in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x2ae3e088, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#13 0x0000000007b19330 in NOX::Direction::Newton::compute (this=0x25b941f0, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#14 0x0000000007b36ed8 in NOX::Direction::Generic::compute (this=0x25b941f0, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#15 0x0000000007b196fe in NOX::Direction::Newton::compute (this=0x25b941f0, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#16 0x0000000007a63188 in NOX::Solver::LineSearchBased::step (this=0x2c402a60) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#17 0x0000000007a6356f in NOX::Solver::LineSearchBased::solve (this=0x2c402a60) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:236
#18 0x00000000078c242a in LOCA::Stepper::compute (this=0x25b8dfb0, stepStatus=LOCA::Abstract::Iterator::Successful) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:594
#19 0x0000000007940a29 in LOCA::Abstract::Iterator::iterate (this=0x25b8dfb0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:150
#20 0x000000000794096a in LOCA::Abstract::Iterator::run (this=0x25b8dfb0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:128
#21 0x0000000006e5472e in Piro::LOCASolver<double>::evalModelImpl (this=0x25461560, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#22 0x0000000004c1e84b in Thyra::ModelEvaluatorDefaultBase<double>::evalModel (this=0x254616f0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#23 0x0000000004a7e769 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775

It looks like Teko::rebuildInverse() is throwing the std::exception. Since LOCA_Stepper.C is only catching NOX::Exceptions::SolverFailures, this std::exception makes it through and winds up killing Charon. I could try the following:

diff --git a/packages/teko/src/Teko_StratimikosFactory.cpp b/packages/teko/src/Teko_StratimikosFactory.cpp
index 182819d..d3761c5 100644
--- a/packages/teko/src/Teko_StratimikosFactory.cpp
+++ b/packages/teko/src/Teko_StratimikosFactory.cpp
@@ -1,3 +1,5 @@
+#include "NOX_Exceptions.H"
+
 #include "Teko_StratimikosFactory.hpp"
 
 #include "Teuchos_Time.hpp"
@@ -220,7 +222,17 @@ void StratimikosFactory::initializePrec_Thyra(
      if(prec_Op==Teuchos::null)
         prec_Op = Teko::buildInverse(*invFactory_,fwdOp);
      else
-        Teko::rebuildInverse(*invFactory_,fwdOp,prec_Op);
+     {
+        try
+        {
+           Teko::rebuildInverse(*invFactory_,fwdOp,prec_Op);
+        }
+        catch (...)
+           TEUCHOS_TEST_FOR_EXCEPTION(true, NOX::Exceptions::SolverFailure,
+              "StratimikosFactory::initializePrec_Thyra():  I'm afraid "      \
+              "something went wrong in Teko::rebuildInverse().  Treating "    \
+              "this as a solver failure.")
+     }
   }
 
   // construct preconditioner

but that introduces a NOX dependency into Teko that isn't otherwise there. Alternatively, we could try the following in NOX:

diff --git a/packages/nox/src-thyra/NOX_Thyra_Group.C b/packages/nox/src-thyra/NOX_Thyra_Group.C
index 0fc4542..08225ec 100644
--- a/packages/nox/src-thyra/NOX_Thyra_Group.C
+++ b/packages/nox/src-thyra/NOX_Thyra_Group.C
@@ -68,6 +68,7 @@
 #include "NOX_Abstract_MultiVector.H"
 #include "NOX_Thyra_MultiVector.H"
 #include "NOX_Assert.H"
+#include "NOX_Exceptions.H"

 NOX::Thyra::Group::
 Group(const NOX::Thyra::Vector& initial_guess,
@@ -890,6 +890,7 @@ void NOX::Thyra::Group::updateLOWS() const
 
   this->scaleResidualAndJacobian();
 
+  try
   {
     NOX_FUNC_TIME_MONITOR("NOX Total Preconditioner Construction");
 
@@ -932,6 +933,10 @@ void NOX::Thyra::Group::updateLOWS() const
     }
 
   }
+  catch (...)
+    TEUCHOS_TEST_FOR_EXCEPTION(true, Exceptions::SolverFailure,
+      "NOX::Thyra::Group::updateLOWS():  I'm afraid something went wrong in " \
+      "creating the preconditioner.  Treating this as a solver failure.")
 
   this->unscaleResidualAndJacobian();

@etphipp, @eric-c-cyr, is that an acceptable solution?

Failing that, at this point I think my time would be better spent writing a Python wrapper around Charon that'll detect failures in the midst of a LOCA run and restart. Probably should've done that months ago.

@etphipp
Copy link
Contributor

etphipp commented Mar 29, 2018

The first approach I think you just can't do, because NOX (indirectly) depends on Teko and you can't have circular dependencies.

I can live with the second approach, but it seems far from ideal. What if a real failure happens that shouldn't be treated as just a solver failure. I think the best solution would be do have a set of exceptions that are independent of Thyra, NOX, Teko, ... that capture this case. Maybe such a thing could be in Teuchos?

@jmgate
Copy link
Contributor Author

jmgate commented Mar 29, 2018

Seems reasonable. Who can we talk to in @trilinos/teuchos about this?

@bartlettroscoe
Copy link
Member

@etphipp said:

I think the best solution would be do have a set of exceptions that are independent of Thyra, NOX, Teko, ... that capture this case. Maybe such a thing could be in Teuchos?

@jmgate said:

Seems reasonable. Who can we talk to in @trilinos/teuchos about this?

That seems reasonable. The question is which Teuchos subpackage would they go in? What is the set of exception classes being proposed?

@etphipp
Copy link
Contributor

etphipp commented Mar 29, 2018

At this point I think we are talking about just one exception that represents a numerical solver failure (e.g., preconditioner applied to a singular matrix), although one could imagine others.

@bartlettroscoe
Copy link
Member

I think the best option is the break the solver interfaces currently in the TeuchosReminder subpackage:

packages/teuchos/remainder/src/Trilinos_Details_LinearSolverFactory.cpp
packages/teuchos/remainder/src/Trilinos_Details_LinearSolverFactory.hpp
packages/teuchos/remainder/src/Trilinos_Details_LinearSolver.hpp

and create a new Teuchos subpackage TeuchosSolverInterfaces to move these interfaces into and then create the file:

packages/teuchos/solver_interfaces/src/Teuchos_SolverExceptions.hpp

This would be killing two birds with one stone (i.e. moving these solver interfaces into a logical subpackage and provide a place for some generic solver exceptions).

We would then derive some of the exceptions in Thyra, NOX, and other packages from these exceptions.

@mhoemmen, what do you think about this idea?

@jmgate
Copy link
Contributor Author

jmgate commented Nov 27, 2018

Ping. Charon still has an open issue waiting on the resolution of this one.

What exactly is the procedure to follow for an application to request something from Trilinos and have it actually happen?

@mhoemmen
Copy link
Contributor

@bartlettroscoe I didn't see that earlier message of yours. I'm OK with the plan that you proposed, to move the solver interface stuff in TeuchosRemainder into a new subpackage, TeuchosSolverInterfaces.

@jmgate wrote:

What exactly is the procedure to follow for an application to request something from Trilinos and have it actually happen?

Best practice currently is to have a Trilinos developer on your team.

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Dec 3, 2018
@trilinos/teuchos @trilinos/nox @trilinos/stratimikos @trilinos/thyra

@jmgate requested that I add an exception class to Teuchos, so that NOX
and Stratimikos / Thyra can share a common exception class for reporting
failure in setup of a linear solver.  I consulted with @jmgate on 03 Dec
2018 to scope the work.  This commit adds the requested class,
Trilinos::LinearSolverSetupFailure, as well as a unit test.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Dec 3, 2018
@trilinos/teuchos @trilinos/nox @trilinos/stratimikos @trilinos/thyra

@jmgate requested that I add an exception class to Teuchos, so that NOX
and Stratimikos / Thyra can share a common exception class for reporting
failure in setup of a linear solver.  I consulted with @jmgate on 03 Dec
2018 to scope the work.  This commit adds the requested class,
Trilinos::LinearSolverSetupFailure, as well as a unit test.

This commit fixes trilinos#3983, and works towards resolving trilinos#1608.
@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 3, 2018

PR #3983 adds the new exception class.

mhoemmen added a commit that referenced this issue Dec 4, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
@trilinos/teuchos @trilinos/nox @trilinos/stratimikos @trilinos/thyra

@jmgate requested that I add an exception class to Teuchos, so that NOX
and Stratimikos / Thyra can share a common exception class for reporting
failure in setup of a linear solver.  I consulted with @jmgate on 03 Dec
2018 to scope the work.  This commit adds the requested class,
Trilinos::LinearSolverSetupFailure, as well as a unit test.

This commit fixes trilinos#3983, and works towards resolving trilinos#1608.
@jmgate jmgate closed this as completed Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: LOCA LOCA inside of NOX package pkg: NOX pkg: Stratimikos pkg: Thyra Issues primarily dealing with the Thyra Package type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

5 participants