-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failing nightly tests in dashboards! #398
Comments
A few more things:
|
Thanks for creating this list Irina. I have a few suggestions for preventing this sort of thing from happening again:
It'd be nice to create a single wiki page for all of this. |
I suspect that the AlbanyIntel test is testing options that other tests do not. I see a variety of issues - this test enables FPE checking so some tests are failing due to that. It also tests the RPI SCOREC adaptive meshing capabilities - some of these are failing with an error like: Throw test that evaluated to true: (lclNumRows != A.getLocalLength ()) The Crystal Plasticity tests are diffing. I'm not sure what to suggest here - I should probably turn my tests over to others if they still have value as my available time to stay on top of these has vanished. I like @jewatkins idea to combine similar tests into single tests with a superset of the features that are of value to the overall project. I'll volunteer to remove mine if one of the other tests would like to absorb any features of value in them. Do we want to check for FPE's in our test suite? I personally would like to have the option to debug using FPE's, but some tests have so many that this ability is no longer there. A couple of us cleaned up all the FPEs in the code a while back |
I think we can give it a try to turn on FPEs in our nightlies. If we get too many red herrings, then we turn them off, otherwise we keep them. Looking at the FPEs enabled in main, I think we should only catch meaningful ones. |
Thanks for the comments thus far. A few replies / follow on comments: 1.) I think it's probably a good idea for someone to take over @gahansen 's tests and monitor / police them regularly. I am somewhat wary about volunteering for this as I already own / monitor 10 nightly tests of Albany... when things get broken, trying to figure out what broke and trying to fix it or finding who to contact to fix it does get time consuming. 2.) I am wary about sweeping the FPE failures under the rug for two reasons:
3.) I could temporarily turn on FPE checking in one of my nightly tests to see if the behavior obtained is similar to Glen's tests. 4.) I see some value in keeping a debug build on the dashboard. I believe @lxmota used to have some debug builds but it appears they are gone - Alejandro, are those gone b/c they were on some of the machines you used to have that are no longer up (procyon, antares, etc.). 5.) When I spoke with Jerry this morning, he mentioned that for every failure there are 3 people who could try to fix it: the owner of the dashboard, the owner of the package, and/or the author of the commit that broke the tests. I would say the author of the commit is the natural person to fix things in general, who is likely to be identified by the owner of the dashboard - this is fairly straightforward if bugs are caught shortly after being pushed. In the case that we have so many failures that have been happening for months, and the failures could be due to a number of large Albany refactors and/or Trilinos changes, I am not sure how to best proceed with fixing the failures... I'm afraid the package owners may not have the bandwidth to do this in a timely fashion given the nature / number of the failures. |
One other unique feature of the CEE tests that I own is the use of MKL for Blas/Lapack. That could account for some of the diffs seen there. I'd be glad to walk through these tests for anyone considering harvesting them (or just taking them over). The last dashboard "greenification" might be instructive for us - here is the closed issue https://github.com/gahansen/Albany/issues/135 Here is one where we tracked down a few of the FPEs at the time https://github.com/gahansen/Albany/issues/260 I have had FPE checking on in these tests for much of my tenure on the Albany project - it was active last summer. Many of these new FPEs have snuck in since the summer. The Albany64BitDbg test, however, is a new test added Feb 1 |
I stopped running the debug builds because they were never clean and flagged bugs that were in either Trilinos or non-LCM-Albany that remained there for years.
Yes, I used to run them from other machines, but they could be easily turned on again on Algol && Proxima.
Antares ran nightly tests on Ubuntu, but that build is no longer used, so I stopped running those as well.
Alejandro
4.) I see some value in keeping a debug build on the dashboard. I believe @lxmota<https://github.com/lxmota> used to have some debug builds but it appears they are gone - Alejandro, are those gone b/c they were on some of the machines you used to have that are no longer up (procyon, antares, etc.).
|
@ikalash I fixed some errors, that were due to the merge of #356 . The fixes will be merged with #396 . I would suggest to merge that asap, to see how much gets fixed. Note: that PR still has 1 test failing, but I think it is due to something else (perhaps changes in trilinos?). The error looks like
|
Which test is this? It looks like it's a test using the trapezoidal rule time integrator for 2nd order in time ODEs written by @gahansen originally in Piro. The thing to do now is to probably switch to the Tempus integrators instead of going through this code path, but I think some tests still use it (SCOREC ones perhaps). I find it unlikely that anyone has touched that Piro code recently. |
It's |
Hmmm. I would say (and @amota will agree) that we don't care about the 2nd order integrators in Piro and should just switch over all the tests to Tempus that do not use it already. The RPI folks might have some need for the Piro integrators but I would say in core LCM applications, we do not. I'd still be interested to understand why this is failing all of a sudden. I think lets merge your stuff in @bartgol and see how much it fixes. I talked with @amota today and thought about it, and I think it is worthwhile to understand why the FPE and other failures started, and to try to fix them. I guess I am volunteering to look at this since I do not think anyone else will... I can start by adding an FPE check on build to my own nightlies. I am hoping some of the issues can be thrown to Trilinos folks based on the errors within Trilinos, and that gdb can point to the locations of the FPEs... but maybe it is harder than this. I would propose @bartgol for you to push all your remaining refactor stuff to master when it's ready, then I can look at debugging the tests. Maybe waiting until the meeting next week is good so we can priorities tests / builds is best. |
I created a build on my Fedora 28 workstation which uses gcc-8.2.1 with FPE check = ON in Albany. It is interesting that only 3 tests fail: Elasticity3DPressureBC (http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=78898). So it seems the numerous failures in @gahansen 's build are somewhat dependent on the compiler (Clang, Intel)? |
That's bad, since it forces us to debug with specific compilers. And by the way, I care about clang and intel builds more: the former because it's the most standard-compliant compiler, and the latter because it's the one you would use on a cpu performance run. |
Right. I agree that clang and intel should not be overlooked. I think a lot of folks care about Intel in particular. @gahansen are you able to call in to the meeting next Monday? I think it would be good for you to be there to discuss / make a decision about how to proceed with the testing (what testing to do, who to pass your tests off to, what to try to fix). |
I have created an all-debug build on my machine with FPEs enabled. It will start nightly tomorrow. For one of the tests that failed with FPEs, I ran gdb and here is the error:
I think the problem is alpha can be 0, so there is a division by 0 in Intrepid2. This may explain why in Glen's debug build, an exception was being thrown in Intrepid2. @mperego can you please look into this / help to get it fixed? Should I open a Trilinos issue? |
Here is another cause of FPEs, this time in Sacado:
(https://github.com/trilinos/Trilinos/blob/master/packages/sacado/src/Sacado_Fad_Ops.hpp#L655-L666). Looks like there is another possible division by 0. Probably I should open a Trilinos issue for this too. |
Does this have to do with us feeding bad inputs to Intrepid2/Sacado or is it an internal error in those packages? In the former case, we should check why out inputs are buggy, while in the second case it's a Trilinos issue. By the way, how can the test be fine in RELEASE mode if we have a division by 0?!? |
@bartgol : I agree with you, we should investigate more if it's Trilinos or our usage of Trilinos. Regarding the Intrepid2 issue in particular: from the dashboard it appears the problem started on 11/9. We need to check if anything was pushed to Albany that day that would have started the issue. Regarding what happens in release mode: the tests DO die in some builds if you look at my spreadsheet - the Intel and Clang ones in particular. I think behavior with FPEs can depend on the compiler. Depending on where the NaN is and what is done with it, the code can actually run to completion with some compiler despite there being an FPE. |
Make sense. |
an error in the Tpetra MV dot routine, towards resolving issue #398.
@ikalash is the Intrepid2 issue still there? the one associated to this error message: |
@mperego : I believe my commit last night fixed that error. |
@ikalash thanks! |
This commit should fix a lingering issue from the PHAL::Neumann refactor (commit fe52d48) that was affecting tests using the traction BC. Specifically, with this push, the following tests should be back up on Glen's builds: Pressure_hex8_trac Pressure_tetra10_tip Pressure_tetra10_trac Pressure_tetra4_tip Pressure_tetra4_trac SCOREC_Elasticity_Rename_Tpetra SCOREC_Elasticity_TracT_SM SCOREC_Elasticity_Traction_Tpetra StaticElasticity2D_Traction StaticElasticity3D_Traction
This push should fix the test SteadyHeat3DTest_D.
Looks like my changes to the radiate BC got clobbered by someone else's commits yesterday, causing the SteadyHeat3DTest_D test to fail. Fixing it again.
Things are reasonably clean finally, so closing this issue. There are still a couple failing tests in some platforms, but separate issues to address those exist. |
This issue is a reincarnation of issue #61.
Per the discussion at yesterday's Albany meeting, I have compiled a spreadsheet with a list of all the tests currently failing in the Albany dashboards (attached). There is a fair bit of information here, namely for each test you can see 1.) what nightly it is failing in, and 2.) how it is failing. It is interesting that not all the tests fail everywhere, and not all the tests fail in the same way across all architectures. Here is a list of the failing tests.
ATO:RegHeaviside_3D
ATOT:RegHeaviside_3D
CrystalPlasticity_DislocationDensityHardening
CrystalPlasticity_MinisolverStep_Newton
CrystalPlasticity_MinisolverStep_NewtonLineSearch
CrystalPlasticity_MiniSolverStep_TrustRegion
CrystalPlasticity_MultiFamily
CrystalPlasticity_MultiSlipHard_Implicit
CrystalPlasticity_MultiSlipHard_Implicit_Active_Sets
CrystalPlasticity_OrientationNotOnMesh
CrystalPlasticity_OrientationNotOnMesh_np4
CrystalPlasticity_OrientationOnMesh
CrystalPlasticity_OrientationOnMesh_np4
CrystalPlasticity_QuadSlipDislocationDensityTraction
CrystalPlasticity_SchwarzBar_modified_np1
CrystalPlasticity_SingleElement2d_ElasticShear2d
CrystalPlasticity_SingleElement2d_PlasticShear2d
CrystalPlasticity_SingleElement3d_ElasticShear3d
CrystalPlasticity_SingleElement3d_ElasticShearRotated3d
CrystalPlasticity_SingleSlip_Explicit
CrystalPlasticity_SingleSlip_Implicit
CrystalPlasticity_SingleSlipHard_Explicit
CrystalPlasticity_SingleSlipHard_Implicit
CrystalPlasticity_SingleSlipSaturation
CrystalPlasticity_ThermallyActivatedSlip
Dynamic_ClampedSDBC_NewmarkExplicitAForm_BLMesh_Tempus
Dynamics
Dynamics_SCOREC_Adapt_Tpetra
Dynamics_SCOREC_Tpetra
Elasticity3DPressureBC
Enthalpy
FO_GIS_GisCoupledThicknessTpetra
FO_GIS_GisSensSMBwrtBetaTpetra
Heat3DPUMI_Tpetra_RegressFail
HeliumODEs_HeBubbles
HeliumODEs_HeBubblesDecay
HydrogenKfieldBC
LinComprNS_2DUnvteadyInvPressPulse
Mechanics_PlasticityJ2_2D_Traction
Mechanics_PlasticityJ2_3D_Traction
Mechanics_PorePressureParallelFlow_Serial
Mechanics_PorePressureSimple_Serial
Mechanics2D_J2
MechanicsPorePressureLocalized_Serial
MechanicsTensileCT
MechanicsWithHelium_JustMechanics
MechanicsWithHelium_MechanicsAndHelium
MechanicsWithHelium_MechanicsAndHeliumV2
MechanicsWithHelium_MechanicsAndHydrogen
MechanicsWithHelium_MechanicsAndHydrogenV2
MechanicsWithHydrogen_SERIAL
MechanicsWithHydrogenBar_no_stabilization
MechanicsWithHydrogenBar_requires_stabilization
MechanicsWithHydrogenOrthogonal_SERIAL
MechanicsWithHydrogenParallel_SERIAL
MechanicsWithTemperatureLinearThermalExpansion
MechWithHydrogenFastPath_channel_diffusion
NSVortexShedding2D_TransIRK_Tpetra
Parallel_Dynamic_Cubes_Newmark_Piro
Pressure_hex8
Pressure_hex8_tip
Pressure_hex8_trac
Pressure_tetra10
Pressure_tetra10_tip
Pressure_tetra10_trac
Pressure_tetra4
Pressure_tetra4_tip
Pressure_tetra4_trac
RigidBody
Schwarz_Alternating_Dynamics_CubesInelastic
SCOREC_BimetallicStrip_Traction_Tpetra
SCOREC_ElastAdapt_Necking_SERIAL_Necking_SERIAL_Tpetra
SCOREC_ElastAdapt_Necking_Tpetra
SCOREC_ElastAdapt_SPR_Tpetra_postParma
SCOREC_ElastAdapt_SPR_Tpetra_postZoltan
SCOREC_ElastAdaptSPR_Tpetra
SCOREC_Elasticity_Necking_Tpetra
SCOREC_Elasticity_NeckT_SM
SCOREC_Elasticity_Rename_Tpetra
SCOREC_Elasticity_TracT_SM
SCOREC_Elasticity_Traction_Tpetra
SCOREC_J2Adapt_Tpetra
SCOREC_J2Adapt_Verification_Tpetra
SCOREC_J2Tet10_Tpetra
SCOREC_MechWithTemp_Tpetra
SCOREC_MechWithTemp_Unif_Tpetra
SCOREC_Restart_NoRestartT
SCOREC_Restart_RestartFromFileT
SCOREC_Restart_WriteRestartT
SCOREC_ThermoMechanicalCan_mech_tpetra
SCOREC_ThermoMechanicalCan_thermomech_tpetra
SCOREC_ThermoMechanicalCan_timedep_thermomech_tpetra
Serial_Dynamic_Cubes_Newmark_Piro
StaticElasticity2D_Traction
StaticElasticity3D_Traction
SteadyHeat2D
SteadyHeat2DRobin_Tpetra
SteadyHeat2DSS_dudxdudy_Tpetra
SteadyHeatConstrainedOpt2D_Dirichlet_Mixed_ParamsT
StrongDBC
ThermoMechanicalCan_mech
ThermoMechanicalCan_thermomech
TimeDependentSDBC
It is a lot... you can see with this many tests failing it is difficult to keep track of new things that might get broken in the code.
Some notes summarizing the attached spreadsheet:
In addition to the above failures, the Peridigm build fails to build. @mperego is aware of this and a decision to fix or drop Peridigm will be made sometime in January.
How should we proceed? Should people check the spreadsheet and claim some tests to fix? The good news is I think if one issue is fixed, a lot of the tests will be fixed (for instance, I suspect the LCM FPE tests all suffer from the same problem).
failures.xlsx
The text was updated successfully, but these errors were encountered: