New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOX: Treat Exception as Solve Failure #1608
Comments
Email from @jmgate to @etphipp:
|
Response:
|
Response:
|
Response:
|
From @rppawlo:
|
Response:
|
Response:
|
At this point I'm waiting on Suzey Gao to return from vacation so I can get my hands on her actual example. Once I have it I can start experimenting with catching the exception in |
So after our discussion yesterday with Teko team, I think what we really need is to add a generic exception in Thyra or Stratimikos that all preconditioners can use to throw if they encounter a catastrophic failure. Then we could put into the NOX or LOCA a catch on this particular exception that allows LOCA to cut the continuation step and restart the solver. The reason I would like a specific exception is that we need to differentiate this from all other exceptions where the entire simulation should truly terminate (don't want to waste compute cycles running garbage for other exceptions). Does this sound reasonable @jmgate @eric-c-cyr @etphipp @egphill ? |
Yes, that sounds good to me. |
@rppawlo, do you have a guess as to how long it might take to build in this generic exception to Thyra/Stratimikos? Just need to figure out if I tell Charon to wait a week or give everyone a patch to tide them over. |
Is adding this generic exception to @trilinos/thyra or @trilinos/stratimikos something I should take on? I don't really have any experience contributing to either package. |
Is the issue that a preconditioner is applied as Thyra::LinearOpBase::apply()? That function has no way of failing (it is assumed to always pass). So would you need an exception called something like Thyra::LinearOpApplyFailed which would mean that Thyra::LinearOpBase::apply() really should have been able to compute the application of the linear operator but could not for some reason (e.g. max num iterations exceeded). I think you want a specific name like this so you don't accidentally catch some other exception that should bring the program down. It seems like that might not be too hard to add. But the problem is that various solver packages would need to be upgraded to to respond to that exception in a logical way. But if you only upgrading LOCA to respond to this then this would provide value. |
@rppawlo, any ETA on a preconditioner catastrophic failure exception in Thyra or Stratimikos that NOX or LOCA can then catch? |
All that needs to be done on the Thyra side is to define the exception class Thyra::LinearOpApplyFailed and then document it in the Thyra::LinearOpBase::apply() function documentation. Then subclasses of Thyra::LinearOpBase need to throw an exception of that type and clients of Thyra::LinearOpBase need to catch exceptions of that type. @jmgate, can you take a stab at updating the Thyra_OperatorVectorTypes.hpp file to add that new exception type (see other examples there) and then update the file Thyra_LinearOpBase_decl.hpp file to document that exception class in the documentation for the functioin Thyra::LinearOpBase::apply()? Then I can review it. |
Yup, I'll give it a shot. |
Added the exception—working on documentation… |
Added the LinearOpApplyFailed exception.
Added documentation of Thyra::Exceptions::LinearOpApplyFailed to LinearOpBase::apply().
Modified LOCA_Stepper to catch a Thyra::Exceptions::LinearOpApplyFailed and treat that as a solver failure.
If you have an working version of the BinUtils library, you should be able to turn on stack tracking so when that final exception is caught, you can see the stack trace. See: |
This problem may not actually exist. I was able to run the example in question and the entire continuation run completed successfully without throwing any exceptions. I've asked @suzeygao to do a clean build pointing to the same commits I'm looking at to see if we can reproduce the problem. Sorry to have bothered you all. |
Sorry it's taken me so long to get back to this. I was able to reproduce the problem Suzey's seeing, and was able to reproduce it from a restart so we don't have to wait 30 hours per run to debug. I obtained the stracktraces below by running
This spits out a stacktrace any time Stacktraces (click to expand)
I'm afraid this leaves me puzzled. Where is this |
I think I see what's going on here. It looks like there are two other occurrences of |
@etphipp, whenever you're in, could you give me a rundown of how this |
I’ll be back in the office next week. We can talk then.
…-Eric
On Mar 9, 2018, at 3:59 AM, Jason M. Gates <notifications@github.com<mailto:notifications@github.com>> wrote:
@etphipp<https://github.com/etphipp>, whenever you're in, could you give me a rundown of how this Stepper class works?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1608 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AJBpUGeMSbx1A4ZSjnigfjvFi5R3b4Kfks5tcX-RgaJpZM4O3t8k>.
|
Apparently I was running this case from a restart incorrectly. I fired off the restart run correctly, with what I'm hoping are the same parameters that would've existed at the end of the failed run, and this has been running fine for days now. I'm afraid I'm going to have to try the original run from scratch in |
I'm afraid I can't debug this case. Running |
So I FINALLY was able to reproduce this problem in a debugger.
It looks like diff --git a/packages/teko/src/Teko_StratimikosFactory.cpp b/packages/teko/src/Teko_StratimikosFactory.cpp
index 182819d..d3761c5 100644
--- a/packages/teko/src/Teko_StratimikosFactory.cpp
+++ b/packages/teko/src/Teko_StratimikosFactory.cpp
@@ -1,3 +1,5 @@
+#include "NOX_Exceptions.H"
+
#include "Teko_StratimikosFactory.hpp"
#include "Teuchos_Time.hpp"
@@ -220,7 +222,17 @@ void StratimikosFactory::initializePrec_Thyra(
if(prec_Op==Teuchos::null)
prec_Op = Teko::buildInverse(*invFactory_,fwdOp);
else
- Teko::rebuildInverse(*invFactory_,fwdOp,prec_Op);
+ {
+ try
+ {
+ Teko::rebuildInverse(*invFactory_,fwdOp,prec_Op);
+ }
+ catch (...)
+ TEUCHOS_TEST_FOR_EXCEPTION(true, NOX::Exceptions::SolverFailure,
+ "StratimikosFactory::initializePrec_Thyra(): I'm afraid " \
+ "something went wrong in Teko::rebuildInverse(). Treating " \
+ "this as a solver failure.")
+ }
}
// construct preconditioner but that introduces a NOX dependency into Teko that isn't otherwise there. Alternatively, we could try the following in NOX: diff --git a/packages/nox/src-thyra/NOX_Thyra_Group.C b/packages/nox/src-thyra/NOX_Thyra_Group.C
index 0fc4542..08225ec 100644
--- a/packages/nox/src-thyra/NOX_Thyra_Group.C
+++ b/packages/nox/src-thyra/NOX_Thyra_Group.C
@@ -68,6 +68,7 @@
#include "NOX_Abstract_MultiVector.H"
#include "NOX_Thyra_MultiVector.H"
#include "NOX_Assert.H"
+#include "NOX_Exceptions.H"
NOX::Thyra::Group::
Group(const NOX::Thyra::Vector& initial_guess,
@@ -890,6 +890,7 @@ void NOX::Thyra::Group::updateLOWS() const
this->scaleResidualAndJacobian();
+ try
{
NOX_FUNC_TIME_MONITOR("NOX Total Preconditioner Construction");
@@ -932,6 +933,10 @@ void NOX::Thyra::Group::updateLOWS() const
}
}
+ catch (...)
+ TEUCHOS_TEST_FOR_EXCEPTION(true, Exceptions::SolverFailure,
+ "NOX::Thyra::Group::updateLOWS(): I'm afraid something went wrong in " \
+ "creating the preconditioner. Treating this as a solver failure.")
this->unscaleResidualAndJacobian(); @etphipp, @eric-c-cyr, is that an acceptable solution? Failing that, at this point I think my time would be better spent writing a Python wrapper around Charon that'll detect failures in the midst of a LOCA run and restart. Probably should've done that months ago. |
The first approach I think you just can't do, because NOX (indirectly) depends on Teko and you can't have circular dependencies. I can live with the second approach, but it seems far from ideal. What if a real failure happens that shouldn't be treated as just a solver failure. I think the best solution would be do have a set of exceptions that are independent of Thyra, NOX, Teko, ... that capture this case. Maybe such a thing could be in Teuchos? |
Seems reasonable. Who can we talk to in @trilinos/teuchos about this? |
@etphipp said:
@jmgate said:
That seems reasonable. The question is which Teuchos subpackage would they go in? What is the set of exception classes being proposed? |
At this point I think we are talking about just one exception that represents a numerical solver failure (e.g., preconditioner applied to a singular matrix), although one could imagine others. |
I think the best option is the break the solver interfaces currently in the
and create a new Teuchos subpackage
This would be killing two birds with one stone (i.e. moving these solver interfaces into a logical subpackage and provide a place for some generic solver exceptions). We would then derive some of the exceptions in Thyra, NOX, and other packages from these exceptions. @mhoemmen, what do you think about this idea? |
Ping. Charon still has an open issue waiting on the resolution of this one. What exactly is the procedure to follow for an application to request something from Trilinos and have it actually happen? |
@bartlettroscoe I didn't see that earlier message of yours. I'm OK with the plan that you proposed, to move the solver interface stuff in TeuchosRemainder into a new subpackage, TeuchosSolverInterfaces. @jmgate wrote:
Best practice currently is to have a Trilinos developer on your team. |
@trilinos/teuchos @trilinos/nox @trilinos/stratimikos @trilinos/thyra @jmgate requested that I add an exception class to Teuchos, so that NOX and Stratimikos / Thyra can share a common exception class for reporting failure in setup of a linear solver. I consulted with @jmgate on 03 Dec 2018 to scope the work. This commit adds the requested class, Trilinos::LinearSolverSetupFailure, as well as a unit test.
@trilinos/teuchos @trilinos/nox @trilinos/stratimikos @trilinos/thyra @jmgate requested that I add an exception class to Teuchos, so that NOX and Stratimikos / Thyra can share a common exception class for reporting failure in setup of a linear solver. I consulted with @jmgate on 03 Dec 2018 to scope the work. This commit adds the requested class, Trilinos::LinearSolverSetupFailure, as well as a unit test. This commit fixes trilinos#3983, and works towards resolving trilinos#1608.
PR #3983 adds the new exception class. |
@trilinos/teuchos @trilinos/nox @trilinos/stratimikos @trilinos/thyra @jmgate requested that I add an exception class to Teuchos, so that NOX and Stratimikos / Thyra can share a common exception class for reporting failure in setup of a linear solver. I consulted with @jmgate on 03 Dec 2018 to scope the work. This commit adds the requested class, Trilinos::LinearSolverSetupFailure, as well as a unit test. This commit fixes trilinos#3983, and works towards resolving trilinos#1608.
Charon has run into an issue where in the midst of a LOCA continuation run a preconditioner winds up throwing an exception. It would be ideal for this instance to be treated as a solve failure such that LOCA could back up, decrease the step size, and keep on going. We may be able to accomplish this by adding some exception handling logic to
NOX::Thyra::Group::updateLOWS()
, and then modify that routine to return something that will eventually indicate a solve failure.@trilinos/nox
The text was updated successfully, but these errors were encountered: