Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing nightly tests in dashboards! #398

Closed
ikalash opened this issue Dec 4, 2018 · 23 comments
Closed

Fix failing nightly tests in dashboards! #398

ikalash opened this issue Dec 4, 2018 · 23 comments
Labels
duplicate question Testing Stuff related to testing Albany (including nightly tests)

Comments

@ikalash
Copy link
Collaborator

ikalash commented Dec 4, 2018

This issue is a reincarnation of issue #61.

Per the discussion at yesterday's Albany meeting, I have compiled a spreadsheet with a list of all the tests currently failing in the Albany dashboards (attached). There is a fair bit of information here, namely for each test you can see 1.) what nightly it is failing in, and 2.) how it is failing. It is interesting that not all the tests fail everywhere, and not all the tests fail in the same way across all architectures. Here is a list of the failing tests.

ATO:RegHeaviside_3D
ATOT:RegHeaviside_3D
CrystalPlasticity_DislocationDensityHardening
CrystalPlasticity_MinisolverStep_Newton
CrystalPlasticity_MinisolverStep_NewtonLineSearch
CrystalPlasticity_MiniSolverStep_TrustRegion
CrystalPlasticity_MultiFamily
CrystalPlasticity_MultiSlipHard_Implicit
CrystalPlasticity_MultiSlipHard_Implicit_Active_Sets
CrystalPlasticity_OrientationNotOnMesh
CrystalPlasticity_OrientationNotOnMesh_np4
CrystalPlasticity_OrientationOnMesh
CrystalPlasticity_OrientationOnMesh_np4
CrystalPlasticity_QuadSlipDislocationDensityTraction
CrystalPlasticity_SchwarzBar_modified_np1
CrystalPlasticity_SingleElement2d_ElasticShear2d
CrystalPlasticity_SingleElement2d_PlasticShear2d
CrystalPlasticity_SingleElement3d_ElasticShear3d
CrystalPlasticity_SingleElement3d_ElasticShearRotated3d
CrystalPlasticity_SingleSlip_Explicit
CrystalPlasticity_SingleSlip_Implicit
CrystalPlasticity_SingleSlipHard_Explicit
CrystalPlasticity_SingleSlipHard_Implicit
CrystalPlasticity_SingleSlipSaturation
CrystalPlasticity_ThermallyActivatedSlip
Dynamic_ClampedSDBC_NewmarkExplicitAForm_BLMesh_Tempus
Dynamics
Dynamics_SCOREC_Adapt_Tpetra
Dynamics_SCOREC_Tpetra
Elasticity3DPressureBC
Enthalpy
FO_GIS_GisCoupledThicknessTpetra
FO_GIS_GisSensSMBwrtBetaTpetra
Heat3DPUMI_Tpetra_RegressFail
HeliumODEs_HeBubbles
HeliumODEs_HeBubblesDecay
HydrogenKfieldBC
LinComprNS_2DUnvteadyInvPressPulse
Mechanics_PlasticityJ2_2D_Traction
Mechanics_PlasticityJ2_3D_Traction
Mechanics_PorePressureParallelFlow_Serial
Mechanics_PorePressureSimple_Serial
Mechanics2D_J2
MechanicsPorePressureLocalized_Serial
MechanicsTensileCT
MechanicsWithHelium_JustMechanics
MechanicsWithHelium_MechanicsAndHelium
MechanicsWithHelium_MechanicsAndHeliumV2
MechanicsWithHelium_MechanicsAndHydrogen
MechanicsWithHelium_MechanicsAndHydrogenV2
MechanicsWithHydrogen_SERIAL
MechanicsWithHydrogenBar_no_stabilization
MechanicsWithHydrogenBar_requires_stabilization
MechanicsWithHydrogenOrthogonal_SERIAL
MechanicsWithHydrogenParallel_SERIAL
MechanicsWithTemperatureLinearThermalExpansion
MechWithHydrogenFastPath_channel_diffusion
NSVortexShedding2D_TransIRK_Tpetra
Parallel_Dynamic_Cubes_Newmark_Piro
Pressure_hex8
Pressure_hex8_tip
Pressure_hex8_trac
Pressure_tetra10
Pressure_tetra10_tip
Pressure_tetra10_trac
Pressure_tetra4
Pressure_tetra4_tip
Pressure_tetra4_trac
RigidBody
Schwarz_Alternating_Dynamics_CubesInelastic
SCOREC_BimetallicStrip_Traction_Tpetra
SCOREC_ElastAdapt_Necking_SERIAL_Necking_SERIAL_Tpetra
SCOREC_ElastAdapt_Necking_Tpetra
SCOREC_ElastAdapt_SPR_Tpetra_postParma
SCOREC_ElastAdapt_SPR_Tpetra_postZoltan
SCOREC_ElastAdaptSPR_Tpetra
SCOREC_Elasticity_Necking_Tpetra
SCOREC_Elasticity_NeckT_SM
SCOREC_Elasticity_Rename_Tpetra
SCOREC_Elasticity_TracT_SM
SCOREC_Elasticity_Traction_Tpetra
SCOREC_J2Adapt_Tpetra
SCOREC_J2Adapt_Verification_Tpetra
SCOREC_J2Tet10_Tpetra
SCOREC_MechWithTemp_Tpetra
SCOREC_MechWithTemp_Unif_Tpetra
SCOREC_Restart_NoRestartT
SCOREC_Restart_RestartFromFileT
SCOREC_Restart_WriteRestartT
SCOREC_ThermoMechanicalCan_mech_tpetra
SCOREC_ThermoMechanicalCan_thermomech_tpetra
SCOREC_ThermoMechanicalCan_timedep_thermomech_tpetra
Serial_Dynamic_Cubes_Newmark_Piro
StaticElasticity2D_Traction
StaticElasticity3D_Traction
SteadyHeat2D
SteadyHeat2DRobin_Tpetra
SteadyHeat2DSS_dudxdudy_Tpetra
SteadyHeatConstrainedOpt2D_Dirichlet_Mixed_ParamsT
StrongDBC
ThermoMechanicalCan_mech
ThermoMechanicalCan_thermomech
TimeDependentSDBC

It is a lot... you can see with this many tests failing it is difficult to keep track of new things that might get broken in the code.

Some notes summarizing the attached spreadsheet:

  • Floating point exceptions (FPEs) appear in A LOT of LCM tests, which seem to show up only in @gahansen's builds on the CEE. I'm guessing it is because those builds have FPE checking on, whereas the other tests do not. We need to figure out what to do about them.
  • Some of the tests run to completion but are reported as failures on some platforms. This seems non-deterministic. I think it is not a real problem, but it would be nice to understand why this is happening.
  • The Enthalpy test has a DAG "failure to meet dependencies" error. @mperego , @bartgol : this seems like it should be an easy fix - can you please look at it?
  • Some of the failing LCM tests encounter an Intrepid2 and/or DynRankView bounds error in the Albany64BitDbg build. @mperego perhaps you could look at the Intrepid2 error? You will find the tests by checking the attached spreadsheet.
  • A lot of the failing SCOREC tests have a Tpetra MV error or a DynRankView error in the Albany64BitDbg build.

In addition to the above failures, the Peridigm build fails to build. @mperego is aware of this and a decision to fix or drop Peridigm will be made sometime in January.

How should we proceed? Should people check the spreadsheet and claim some tests to fix? The good news is I think if one issue is fixed, a lot of the tests will be fixed (for instance, I suspect the LCM FPE tests all suffer from the same problem).

failures.xlsx

@ikalash ikalash added duplicate question Testing Stuff related to testing Albany (including nightly tests) labels Dec 4, 2018
@ikalash
Copy link
Collaborator Author

ikalash commented Dec 4, 2018

A few more things:

  • Checking the CDash, it appears the builds with the most red (Albany64BitClang, Albany64BitDbg, AlbanyIntel) were completely clean this summer, e.g. mid July: http://cdash.sandia.gov/CDash-2-3-0/index.php?project=Albany&date=20180713 . This means things got messed up in the past few months, which is worrisome.
  • @gahansen, what is special about the AlbanyIntel build relative to the other intel builds on CDash, do you know? It is curious that it has so many failures, whereas the other intel builds are relatively clean.

@jewatkins
Copy link
Collaborator

Thanks for creating this list Irina. I have a few suggestions for preventing this sort of thing from happening again:

  1. Is it possible to narrow down the nightly tests to a few specific builds? (e.g. gcc, intel, clang, cuda)
  2. Can we make these builds easily accessible to developers (at least to sandians) through sems modules or cee machines?
  3. Can we create a list of active developers for each package so that we can ping when something requires a large commitment.

It'd be nice to create a single wiki page for all of this.

@gahansen
Copy link
Member

gahansen commented Dec 4, 2018

I suspect that the AlbanyIntel test is testing options that other tests do not. I see a variety of issues - this test enables FPE checking so some tests are failing due to that. It also tests the RPI SCOREC adaptive meshing capabilities - some of these are failing with an error like:

Throw test that evaluated to true: (lclNumRows != A.getLocalLength ())

The Crystal Plasticity tests are diffing.

I'm not sure what to suggest here - I should probably turn my tests over to others if they still have value as my available time to stay on top of these has vanished. I like @jewatkins idea to combine similar tests into single tests with a superset of the features that are of value to the overall project. I'll volunteer to remove mine if one of the other tests would like to absorb any features of value in them.

Do we want to check for FPE's in our test suite? I personally would like to have the option to debug using FPE's, but some tests have so many that this ability is no longer there. A couple of us cleaned up all the FPEs in the code a while back
https://github.com/gahansen/Albany/issues/260 - but many have crept back in.

@bartgol
Copy link
Collaborator

bartgol commented Dec 4, 2018

I think we can give it a try to turn on FPEs in our nightlies. If we get too many red herrings, then we turn them off, otherwise we keep them. Looking at the FPEs enabled in main, I think we should only catch meaningful ones.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 4, 2018

Thanks for the comments thus far. A few replies / follow on comments:

1.) I think it's probably a good idea for someone to take over @gahansen 's tests and monitor / police them regularly. I am somewhat wary about volunteering for this as I already own / monitor 10 nightly tests of Albany... when things get broken, trying to figure out what broke and trying to fix it or finding who to contact to fix it does get time consuming.

2.) I am wary about sweeping the FPE failures under the rug for two reasons:

  • if you look at my spreadsheets, most of these tests that get FPEs have other problems in other builds, for instance, an exception thrown by a Trilinos package in a debug build. This suggests there might be some underlying issue leading to the FPEs. I would recommend whoever looks at those failures to look at the error in the debug build and try to address that first, then see if it makes the FPEs go away.
  • the dashboard was all green this summer. @gahansen can you please confirm that the FPE checking was on in your builds back then? If it was, then that means something in the past few months creeped in causing the FPEs. That is disturbing to me. I've encountered cases where there are FPEs that are deemed innocuous and therefore we decided to punt on fixing them (we had this in Aeras) but I suspect this is not the case with the current failing tests... though maybe I am wrong.

3.) I could temporarily turn on FPE checking in one of my nightly tests to see if the behavior obtained is similar to Glen's tests.

4.) I see some value in keeping a debug build on the dashboard. I believe @lxmota used to have some debug builds but it appears they are gone - Alejandro, are those gone b/c they were on some of the machines you used to have that are no longer up (procyon, antares, etc.).

5.) When I spoke with Jerry this morning, he mentioned that for every failure there are 3 people who could try to fix it: the owner of the dashboard, the owner of the package, and/or the author of the commit that broke the tests. I would say the author of the commit is the natural person to fix things in general, who is likely to be identified by the owner of the dashboard - this is fairly straightforward if bugs are caught shortly after being pushed. In the case that we have so many failures that have been happening for months, and the failures could be due to a number of large Albany refactors and/or Trilinos changes, I am not sure how to best proceed with fixing the failures... I'm afraid the package owners may not have the bandwidth to do this in a timely fashion given the nature / number of the failures.

@gahansen
Copy link
Member

gahansen commented Dec 4, 2018

One other unique feature of the CEE tests that I own is the use of MKL for Blas/Lapack. That could account for some of the diffs seen there. I'd be glad to walk through these tests for anyone considering harvesting them (or just taking them over).

The last dashboard "greenification" might be instructive for us - here is the closed issue https://github.com/gahansen/Albany/issues/135

Here is one where we tracked down a few of the FPEs at the time https://github.com/gahansen/Albany/issues/260

I have had FPE checking on in these tests for much of my tenure on the Albany project - it was active last summer. Many of these new FPEs have snuck in since the summer. The Albany64BitDbg test, however, is a new test added Feb 1
https://github.com/gahansen/Albany/issues/258

@lxmota
Copy link
Contributor

lxmota commented Dec 4, 2018 via email

@bartgol
Copy link
Collaborator

bartgol commented Dec 4, 2018

@ikalash I fixed some errors, that were due to the merge of #356 . The fixes will be merged with #396 . I would suggest to merge that asap, to see how much gets fixed.

Note: that PR still has 1 test failing, but I think it is due to something else (perhaps changes in trilinos?). The error looks like


p=0: *** Caught standard std::exception of type 'std::logic_error' :

 /home/lbertag/workdir/trilinos/trilinos-install/groppello/gcc/opt/develop/include/Piro_TrapezoidRuleSolver_Def.hpp:470:
 
 Throw number = 1 
 
 Throw test that evaluated to true: Teuchos::is_null(dec)
 
 Underlying model in trapezoid decorator does not cast to a Piro::TransientDecorator<Scalar, LocalOrdinal, GlobalOrdinal, Node>

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 4, 2018

Which test is this? It looks like it's a test using the trapezoidal rule time integrator for 2nd order in time ODEs written by @gahansen originally in Piro. The thing to do now is to probably switch to the Tempus integrators instead of going through this code path, but I think some tests still use it (SCOREC ones perhaps). I find it unlikely that anyone has touched that Piro code recently.

@bartgol
Copy link
Collaborator

bartgol commented Dec 4, 2018

It's Dynamics, from LCM.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 5, 2018

Hmmm. I would say (and @amota will agree) that we don't care about the 2nd order integrators in Piro and should just switch over all the tests to Tempus that do not use it already. The RPI folks might have some need for the Piro integrators but I would say in core LCM applications, we do not. I'd still be interested to understand why this is failing all of a sudden.

I think lets merge your stuff in @bartgol and see how much it fixes.

I talked with @amota today and thought about it, and I think it is worthwhile to understand why the FPE and other failures started, and to try to fix them. I guess I am volunteering to look at this since I do not think anyone else will... I can start by adding an FPE check on build to my own nightlies. I am hoping some of the issues can be thrown to Trilinos folks based on the errors within Trilinos, and that gdb can point to the locations of the FPEs... but maybe it is harder than this.

I would propose @bartgol for you to push all your remaining refactor stuff to master when it's ready, then I can look at debugging the tests. Maybe waiting until the meeting next week is good so we can priorities tests / builds is best.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 5, 2018

I created a build on my Fedora 28 workstation which uses gcc-8.2.1 with FPE check = ON in Albany. It is interesting that only 3 tests fail:

Elasticity3DPressureBC
Pressure_tetra4
Pressure_tetra10

(http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=78898). So it seems the numerous failures in @gahansen 's build are somewhat dependent on the compiler (Clang, Intel)?

@bartgol
Copy link
Collaborator

bartgol commented Dec 5, 2018

That's bad, since it forces us to debug with specific compilers. And by the way, I care about clang and intel builds more: the former because it's the most standard-compliant compiler, and the latter because it's the one you would use on a cpu performance run.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 5, 2018

Right. I agree that clang and intel should not be overlooked. I think a lot of folks care about Intel in particular. @gahansen are you able to call in to the meeting next Monday? I think it would be good for you to be there to discuss / make a decision about how to proceed with the testing (what testing to do, who to pass your tests off to, what to try to fix).

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 7, 2018

I have created an all-debug build on my machine with FPEs enabled. It will start nightly tomorrow. For one of the tests that failed with FPEs, I ran gdb and here is the error:

Thread 1 "AlbanyT" received signal SIGFPE, Arithmetic exception.
0x00007ffff6a2d53d in Intrepid2::Kernels::inv_scalar_mult_mat<Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> >, double, Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> > > (B=..., alpha=0, A=...)
    at /home/ikalash/nightlyAlbanyTests/Results/Trilinos/build-dbg/install/include/Intrepid2_Kernels.hpp:514
514	          A(i,j) = B(i,j)/alpha;

I think the problem is alpha can be 0, so there is a division by 0 in Intrepid2. This may explain why in Glen's debug build, an exception was being thrown in Intrepid2. @mperego can you please look into this / help to get it fixed? Should I open a Trilinos issue?

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 7, 2018

Here is another cause of FPEs, this time in Sacado:

Thread 1 "Albany" received signal SIGFPE, Arithmetic exception.
0x00007fffef89374c in Sacado::Fad::Expr<Sacado::Fad::PowerOp<Sacado::Fad::Expr<Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::ConstExpr<double> >, Sacado::Fad::ExprSpecDefault>::fastAccessDx (this=0x7fffffff2e30, i=0)
    at /home/ikalash/nightlyAlbanyTests/Results/Trilinos/build-dbg/install/include/Sacado_Fad_Ops.hpp:655

(https://github.com/trilinos/Trilinos/blob/master/packages/sacado/src/Sacado_Fad_Ops.hpp#L655-L666). Looks like there is another possible division by 0. Probably I should open a Trilinos issue for this too.

@bartgol
Copy link
Collaborator

bartgol commented Dec 7, 2018

Does this have to do with us feeding bad inputs to Intrepid2/Sacado or is it an internal error in those packages? In the former case, we should check why out inputs are buggy, while in the second case it's a Trilinos issue.

By the way, how can the test be fine in RELEASE mode if we have a division by 0?!?

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 7, 2018

@bartgol : I agree with you, we should investigate more if it's Trilinos or our usage of Trilinos. Regarding the Intrepid2 issue in particular: from the dashboard it appears the problem started on 11/9. We need to check if anything was pushed to Albany that day that would have started the issue.

Regarding what happens in release mode: the tests DO die in some builds if you look at my spreadsheet - the Intel and Clang ones in particular. I think behavior with FPEs can depend on the compiler. Depending on where the NaN is and what is done with it, the code can actually run to completion with some compiler despite there being an FPE.

@bartgol
Copy link
Collaborator

bartgol commented Dec 7, 2018

Make sense.

ikalash added a commit that referenced this issue Dec 10, 2018
an error in the Tpetra MV dot routine, towards resolving issue #398.
@mperego
Copy link
Collaborator

mperego commented Dec 10, 2018

@ikalash is the Intrepid2 issue still there? the one associated to this error message:
Thread 1 "AlbanyT" received signal SIGFPE, Arithmetic exception. 0x00007ffff6a2d53d in Intrepid2::Kernels::inv_scalar_mult_mat<Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> >, double, Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> > > (B=..., alpha=0, A=...) at /home/ikalash/nightlyAlbanyTests/Results/Trilinos/build-dbg/install/include/Intrepid2_Kernels.hpp:514 514 A(i,j) = B(i,j)/alpha;

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 10, 2018

@mperego : I believe my commit last night fixed that error.

@mperego
Copy link
Collaborator

mperego commented Dec 10, 2018

@ikalash thanks!

ikalash added a commit that referenced this issue Dec 11, 2018
This commit should fix a lingering issue from the PHAL::Neumann
refactor (commit fe52d48) that was
affecting tests using the traction BC.  Specifically,
with this push, the following tests should be back up on Glen's builds:

Pressure_hex8_trac
Pressure_tetra10_tip
Pressure_tetra10_trac
Pressure_tetra4_tip
Pressure_tetra4_trac
SCOREC_Elasticity_Rename_Tpetra
SCOREC_Elasticity_TracT_SM
SCOREC_Elasticity_Traction_Tpetra
StaticElasticity2D_Traction
StaticElasticity3D_Traction
ikalash added a commit that referenced this issue Dec 11, 2018
This push should fix the test SteadyHeat3DTest_D.
ikalash added a commit that referenced this issue Dec 12, 2018
Fixing FPEs that were showing up in the Albany64Bit nightly build

This should fix the following tests:

Dynamics
Dynamics_SCOREC_Tpetra
Dynamics_SCOREC_Adapt_Tpetra
ikalash added a commit that referenced this issue Dec 13, 2018
Looks like my changes to the radiate BC got clobbered by someone else's commits
yesterday, causing the SteadyHeat3DTest_D test to fail.  Fixing it again.
ikalash added a commit that referenced this issue Dec 14, 2018
Fixing FPEs in ATO tests.
@ikalash
Copy link
Collaborator Author

ikalash commented Mar 6, 2019

Things are reasonably clean finally, so closing this issue. There are still a couple failing tests in some platforms, but separate issues to address those exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate question Testing Stuff related to testing Albany (including nightly tests)
Projects
None yet
Development

No branches or pull requests

6 participants