Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure no two evaluators evaluate the same field #399

Closed
bartgol opened this issue Dec 7, 2018 · 28 comments · Fixed by #424
Closed

Make sure no two evaluators evaluate the same field #399

bartgol opened this issue Dec 7, 2018 · 28 comments · Fixed by #424
Assignees
Labels
build Phalanx Stuff related to Phalanx infrastructure
Milestone

Comments

@bartgol
Copy link
Collaborator

bartgol commented Dec 7, 2018

Following trilinos/Trilinos#4005, we should now make sure that each field is evaluated by exactly one evaluator (1). This is currently a problem in at least Aeras and LandIce. It's going to take some work to clean these up, but I do think it's worth it. It was me who opened the issue requesting this feature, since I already spent 4h of debugging (twice!) to find out I had two evaluators evaluating the same field.

In most cases (and I think all the failing tests on the dashboard that fail for this reason), the two evaluators are in fact two copies of the same evaluator. In this case, it would be safe to overlook the repetition, and allow PHX to discard one of the two. However, there are cases (as I said, it happened to me twice), where two different evaluators evaluate the same field. Perhaps you refactored the code, but left a piece of the old one in, and now you have ev1 and ev2 both evaluating the same field. Or even worse: you always had two evaluators evaluating almost the same field: the two fields just differ because of their dimensions, but changing mesh element type (or the vector dimension) will make them clash, and you would not notice it.

In short, I think that, despite being annoying to fix things that basically worked, this is after all a useful improvement to the library. I will take care of fixing all the LandIce tests in the next few days. @ikalash do you think you have time to look at the Aeras ones? If it is the same source of error as in LandIce, there is probably an interpolation evaluator that is registered multiple times.

(1) Note: contributed fields are ok.

@bartgol bartgol self-assigned this Dec 7, 2018
@bartgol bartgol added build Phalanx Stuff related to Phalanx infrastructure labels Dec 7, 2018
@bartgol bartgol added this to the Clean build milestone Dec 7, 2018
@bartgol bartgol changed the title Make sure Make sure no two evaluators evaluate the same field Dec 7, 2018
@jewatkins
Copy link
Collaborator

It looks like this was the issue I was having in #392.

@ikalash
Copy link
Collaborator

ikalash commented Dec 7, 2018

Yes I can look at Aeras. Seeing what you did @bartgol for LandIce should be helpful.

A more general comment: I think we need to fix the nightlies before we proceed with the Albany refactor. I hope to make some progress on this next week. I am sorry to keep holding things up with the refactor... but the testing is important and the more code we push w/o fixing lingering issues, the more difficult it will be to fix things later.

@bartgol
Copy link
Collaborator Author

bartgol commented Dec 10, 2018

I'm working at fixing LandIce_StokesFOBase.hpp, which is where the evaluators that cause problems are registered. I hope to put a mini-PR up in the next few days.

@ikalash
Copy link
Collaborator

ikalash commented Dec 10, 2018

Thanks @bartgol . I am looking at this now. Can you please expand on what you think is the problem when the error messages such as the following, where ev1 = ev2 in your notation:

Error: PHX::DagManager::registerEvaluator() - The field "Flow State:double:<Cell,Node,Dim>(150,9,4)" that is evaluated by the evaluator named "Gather Solution<Residual>" is already evaluated by another registered evaluator named "Gather Solution<Residual>".

or

 Error: PHX::DagManager::registerEvaluator() - The field "Coord Vec:double:<Cell,QuadPoint,Dim>(750,8,3)" that is evaluated by the evaluator named "MapToPhysicalFrame<Residual>" is already evaluated by another registered evaluator named "MapToPhysicalFrame<Residual>".

You are saying we must have duplicate evaluators without realizing it?

@ikalash ikalash self-assigned this Dec 10, 2018
@ikalash
Copy link
Collaborator

ikalash commented Dec 10, 2018

This affects also some LCM, ATO and SCOREC tests. Underlying them are issues with mechanics-like problem dependencies.

@ikalash
Copy link
Collaborator

ikalash commented Dec 10, 2018

@bartgol never mind my previous question - I figured it out. The fixes should be straightforward I think for LCM and hopefully Aeras.

ikalash added a commit that referenced this issue Dec 10, 2018
3 LCM tests still fail due to DAG problems:

69 - MechanicsWithHydrogenParallel_SERIAL (Failed)
70 - MechanicsWithHydrogenOrthogonal_SERIAL (Failed)
71 - MechanicsWithTemperatureTransientHeat2D (Failed)
ikalash added a commit that referenced this issue Dec 10, 2018
This should fix all ShallowWater problems except:

Aeras_TC5_noAD_RungeKutta4
ikalash added a commit that referenced this issue Dec 10, 2018
Should fix all failing Aeras::XZHydrostatic tests.
@bartgol
Copy link
Collaborator Author

bartgol commented Dec 10, 2018

If the two evaluators are the same, the issue is more like a warning: in practice, we could ignore it. However, there's a risk that a more real problem could sneak in, in which different evaluators evaluate the same field (which is more likely to be wrong).

I'm taking care of LandIce, so if you take care of Aeras and LCM, we should be good.

ikalash added a commit that referenced this issue Dec 10, 2018
Aeras::Hydrostatic tests should be back up now.
ikalash added a commit that referenced this issue Dec 10, 2018
All Aeras tests should be running correctly from now on.
ikalash added a commit that referenced this issue Dec 10, 2018
ikalash added a commit that referenced this issue Dec 10, 2018
Should fix MechanicsWithTemperatureTransientHeat2D problem.
@ikalash
Copy link
Collaborator

ikalash commented Dec 11, 2018

Ok, I have fixed the Aeras tests and most of the LCM tests, but I am having some difficulty with the ATO tests. The following tests fail due to this issue:

ATO:FixedField
ATO:Homogenize_2D
ATOT:Homogenize_2D
ATO:MultiPhys_Homogenize_2D
ATOT:MultiPhys_Homogenize_2D
ATO:2Matl_Homog
ATOT:2Matl_Homog

For the first one, the error is as follows:

 Error: PHX::DagManager::registerEvaluator() - The field "Rho0:double:<Cell,Node>(47,8)" that is evaluated by the evaluator named "Gather Nodal Parameter" is already evaluated by another registered evaluator named "Gather Nodal Parameter".

(http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4138034&build=79111). I thought the issue was multiple instances of

ev = Teuchos::rcp(new PHAL::GatherScalarNodalParameter<EvalT,PHAL::AlbanyTraits>(*p, dl) );

but I'm only finding one such call in ATO_Utils_Def.hpp.

For the rest of them, it is this:

 Error: PHX::DagManager::registerEvaluator() - The field "Stress xx_ave :double:<Dummy>(0)" that is evaluated by the evaluator named "Save Field Stress to Cell State Stress xx_ave Residual" is already evaluated by another registered evaluator named "Save Field Stress to Cell State Stress xx_ave Residual".

(http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4138040&build=79111) . I thought the issue was multiple instances of SaveCellStateField calls in LinearElasticityProblem.hpp, but I'm not seeing the duplication. Also the string in the name seems to be messed up (Stress xx_ave).

@jrobbin , as the owner of ATO, if you'd like to keep ATO in the main branch of Albany, can you please have a look at and try to fix these issues?

@ikalash
Copy link
Collaborator

ikalash commented Dec 11, 2018

@mperego : could you please look at resolving this issue for the following tests, which I believe you were the original author of:

SteadyHeatConstrainedOpt2D_Conductivity_Dist_Param
SteadyHeatConstrainedOpt2D_Conductivity_Dist_ParamT

The problem is that Thermal Conductivity is evaluated by the call:

ev = evalUtils.constructGatherScalarNodalParameter(stateName,fieldName);

as well as

  ev = Teuchos::rcp(new PHAL::ConvertFieldTypePSTtoST<EvalT,PHAL::AlbanyTraits>(*p));

(the second call is supposed to convert the field from ParamScalarT to ScalarT). I thought a simple fix would be to feed ConvertFieldTypePSTtoST and Input and output field name, giving the ScalarT version of "Thermal Conductivity" something else, but there was a similar error that the field was evaluated by multiple evaluators.

@ikalash
Copy link
Collaborator

ikalash commented Dec 13, 2018

@bartgol that's great, thanks Luca!

@jewatkins
Copy link
Collaborator

@bartgol Any progress on the LandIce tests? I was going to run some GIS cases. I could probably take a look at them but I didn't want to work on something you may already have a fix for.

bartgol added a commit that referenced this issue Dec 13, 2018
This should take care of all the ATO tests.
@bartgol
Copy link
Collaborator Author

bartgol commented Dec 13, 2018

@jewatkins I just fixed the ATO tests, and was about to switch to the LandIce ones. I have a branch for that, but things are a bit more complicated there, due to the recent refactor in the problem classes. I am working at a fix, which I think is a flexible and general solution going forward. But it may take another few days.

In the short term, you can rebuild trilinos setting

-D Phalanx_ALLOW_MULTIPLE_EVALUATORS_FOR_SAME_FIELD:BOOL=OFF

This should not cause your whole trilinos to rebuild. Luckily, you should only get a few phalanx files to rebuild, so it's not a terrible short-term solution.

@jewatkins
Copy link
Collaborator

@bartgol Okay thanks!

bartgol added a commit to bartgol/Albany that referenced this issue Dec 13, 2018
Now problems need to specify some info about the fields for which they need
interpolation utility evaluators. The StokesFOBase class then proceeds to
automatically construct all these evaluators. This allow to build interpolation
utilities only once, needed after recent changes in Phalanx (see issue sandialabs#399(.
bartgol added a commit to bartgol/Albany that referenced this issue Dec 18, 2018
Now problems need to specify some info about the fields for which they need
interpolation utility evaluators. The StokesFOBase class then proceeds to
automatically construct all these evaluators. This allow to build interpolation
utilities only once, needed after recent changes in Phalanx (see issue sandialabs#399(.
@bartgol bartgol mentioned this issue Dec 18, 2018
@ikalash
Copy link
Collaborator

ikalash commented Dec 20, 2018

Important correction to one of the posts above: to allow for multiple evaluators one actually wants to set

-D Phalanx_ALLOW_MULTIPLE_EVALUATORS_FOR_SAME_FIELD:BOOL=ON

@mperego
Copy link
Collaborator

mperego commented Dec 21, 2018 via email

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 11, 2019

This should have been fixed by #412 . Reopen if nighties disagree.

@bartgol bartgol closed this as completed Jan 11, 2019
@ikalash ikalash reopened this Jan 12, 2019
@ikalash
Copy link
Collaborator

ikalash commented Jan 12, 2019

It looks like LandIce tests are failing still with this PR going in on my workstation, e.g.,

http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=80266

Also all the CISM-Albany tests are failing as well:

http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=80265

It's somewhat odd b/c on other machines things seem to be OK, though there are some spot failures as well, e.g.,

FO_GIS_GisCoupledThickness
FO_GIS_GisCoupledThicknessTpetra
FO_Thermo

in the AlbanyIntel build (http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=80276).

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 14, 2019

It is odd. It may be related to the funky X_DEPENDS_ON_Y options. I did not fully tested them on Friday; I tested them in December, and I naively thought the merge of master into the branch would not change the behavior. Maybe I was wrong. I'm checking right now on my machine.

@ikalash
Copy link
Collaborator

ikalash commented Jan 14, 2019

I don't believe I have that option on yet (what is it again?). I can try to turn it on to see what happens. Also, if you have SRN access, I can give you access to my machine where the nightlies run so you can have a look for yourself at what might be going on.

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 14, 2019

This build has "-DENABLE_PARAMETERS_DEPEND_ON_SOLUTION:BOOL=ON". I haven't checked the other builds, I thought I'd start with the first one you linked. I don't think it mattered though (on my workstation it passes).

I used gcc 7.2 though. I need to check what happens with 8.2, like in your build. I doubt that really matters, but who knows.

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 14, 2019

Also, I do have SRN access. If you give me access to the machine, I can have a direct look.

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 14, 2019

Oh DANG! For some reason on my machine I reconfigured trilinos to allow multiple evaluators to evaluate the same field. That's why everything passed. I'm rebuilding trilinos, and testing again. Sorry about that.

@ikalash
Copy link
Collaborator

ikalash commented Jan 14, 2019

That would explain it. No worries, it happens :). I would think you should be able to reproduce the problem w/ gcc 7 but in case it is needed in the future, I have given you an account on my workstation, camobap.ca.sandia.gov. Please go ahead and try to log on. Your password is your kerberos. The nightlies can be found here: /home/ikalash/nightlyCDash.

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 17, 2019

Just an update: it's taking me a bit longer than I thought. However, I think some changes I'm making will be necessary when (if) we start putting multiple physics (velocity, temperature, hydrology, thickness,...) together in the same execution. Right now we have some fixes, that allow two specific sets of physics to run together, but things are a bit shaky, and especially when activating the funky X_depends_on_Y options, small changes can introduce quite a big headache.

I am currently trying to make the StokesFOBase class capable to handle coupled physics scenarios, while also being compliant with the recent change in Phalanx.

@ikalash
Copy link
Collaborator

ikalash commented Jan 17, 2019

Thanks for the update. Do you have an ETA for fixing this? I am wondering in particular if I should set -D Phalanx_ALLOW_MULTIPLE_EVALUATORS_FOR_SAME_FIELD:BOOL=ON in my nightlies until this gets fixed so that new failures do not slip through the cracks.

@bartgol
Copy link
Collaborator Author

bartgol commented Jan 17, 2019

Yeah, you can do that for now. We can set it back to OFF later.

@ikalash
Copy link
Collaborator

ikalash commented Jan 18, 2019

Ok, I set it to "ON" for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Phalanx Stuff related to Phalanx infrastructure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants