Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quartet of bug fixes for: c3 scheme, quilting restart with 32-bit physics, and string length mismatch in dycore (plus PR #1913, #1917, and #1926) #1893

Merged

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Sep 8, 2023

PR Author Checklist:

  • I have linked PR's from all sub-components involved in section below.
  • I am confirming reviews are completed in ALL sub-component PR's.
  • I have run the full RT suite on either Hera/Cheyenne AND have attached the log to this PR below this line:
    • LOG: Committed to branch. Tested all eight supported platforms. The dev side of wcoss2 was Cactus for the tests.
  • I have added the list of all failed regression tests to "Anticipated changes" section.
  • I have filled out all sections of the template.

Description

There are three bug fixes, and associated changes to regression tests.

  1. The write component no longer calls recover_fields for restart files. (From @DusanJovic-NOAA.) Doing so causes sporadic crashes due to corrupted longitudes when 32-bit physics is used. This only happens for restart files, who don't need recover_fields anyway. Hence, this PR restricts that call to history files.
  2. Correct a string length mismatch in the dycore. This causes an abort when the gnu compiler is run with debug flags turned on. One of the checks enabled is to abort on standard violations of this type.
  3. Changes some subroutine array argument declarations so the c3 scheme can handle null pointers. This is required by the CCPP coding standards anyway, for that very reason. Without this change, the FV3_HRRR_c3 suite crashes with the gnu compiler.
  4. Initialize arrays in dycore. (From @DusanJovic-NOAA.) Two arrays in the dycore were not initialized. That caused crashes in the quilting restart when 32-bit physics is used.
  5. Removes an incorrect dependency of conus13km_2threads on conus13km_control in rt.conf.
  6. Reduces the forecast length of hrrr tests to 12 hours, which is the latest hour that is verified.
  7. Adds Rocoto support on Cheyenne and GAEA.
  8. Merged into this PR is Fix the broken error detection in GFS_phys_time_vary.fv3.F90 and pointer issues in c3 scheme #1917
  9. Also includes Remove nine 3D arrays from CLM Lake and remove an incorrect dependency in rt.conf #1926
  10. And includes GF radar reflectivity update for RRFS realtime runs #1913, which is expected to be merged before this PR.

Linked Issues and Pull Requests

This issue still occurs on Cheyenne, but no other machines. Cheyenne is going to be retired soon, so I'll assume the issue is fixed until we see evidence to the contrary:

Subcomponent issues:

From #1917:

From #1926

Subcomponent Pull Requests

All are in fv3atm, or its subcomponents.

Blocking Dependencies

None.

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Anticipated Changes

Input data

  • No changes are expected to input data.

Regression Tests:

  • Changes are expected to the following tests:

All FV3 tests.

Libraries

  • Not Needed
Code Managers Log
  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.
    • N/A

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Jet
    • Gaea
    • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@SamuelTrahanNOAA SamuelTrahanNOAA changed the title Triad of bug fixes for: c3 scheme, quilting restart with 32-bit physics, and string length mismatch in dycore Quartet of bug fixes for: c3 scheme, quilting restart with 32-bit physics, and string length mismatch in dycore Sep 9, 2023
@BrianCurtis-NOAA
Copy link
Collaborator

Nevermind, I see my error.

@BrianCurtis-NOAA - Are you saying they're working now?

I've examined my test setup on Cactus, and I don't immediately see anything wrong.

with all of the changes made since last go, I had forgotten to grab all the submodule hash updates, it makes sense they all failed and they should all pass now given how the other machines are performing.

@BrianCurtis-NOAA
Copy link
Collaborator

I used rsync to bring Sam's baseline on WCOSS2 to the baseline storage area, and all but 3 tests passed. The three failed with a timeout. So we'll keep Sam's logs for WCOSS2. I am concerned that the atmaq tests are taking longer, but I beleive it to only be a factor due to the machine slowness.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I am concerned that the atmaq tests are taking longer, but I beleive it to only be a factor due to the machine slowness.

Nothing in these changes should slow down I/O rates. Cactus's filesystem has been preposterously slow the past few days. Jobs that normally finish with several minutes to spare are hitting their wallclock limits. Some jobs don't even finish reading input data. Copying input data to the directory can take over twenty minutes. None of my jobs on Acorn hit their wallclock limit.

So, yes, it's the Cactus filesystem.

tests/2threads.conf Outdated Show resolved Hide resolved
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Oct 3, 2023

We can move on to start merging this pr. @SamuelTrahanNOAA we can start asking to merge physics dependency.

@SamuelTrahanNOAA
Copy link
Collaborator Author

This PR is ready for merge. It points to the head of the NOAA-EMC develop branch of FV3. The .gitmodules matches the head of develop.

zach1221
zach1221 previously approved these changes Oct 3, 2023
.gitmodules Outdated Show resolved Hide resolved
@SamuelTrahanNOAA
Copy link
Collaborator Author

@zach1221 You need to re-approve the PR. Github dismissed your review after I updated .gitmodules a few minutes ago.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Oct 3, 2023

@BrianCurtis-NOAA can you approve? It sounds like you need to approve change request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment