Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes for 32-bit physics & correct the lake scheme in FV3_HRRR_c3 & FV3_HRRR_gf #1880

Merged
merged 29 commits into from
Aug 29, 2023

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Aug 24, 2023

PR Author Checklist:

  • I have linked PR's from all sub-components involved in section below.
  • I am confirming reviews are completed in ALL sub-component PR's.
  • I have run the full RT suite on either Hera/Cheyenne AND have attached the log to this PR below this line:
    • LOG: committed to "tests/logs/RegressionTests_hera.log" in this branch
  • I have added the list of all failed regression tests to "Anticipated changes" section.
  • I have filled out all sections of the template.

Description

Corrects a few problems found in RRFS parallel development and consolodates FV3_HRRR suite regression tests:

  1. The 28 conus13km tests are replaced by only 16 new ones. All use 32-bit physics. Intel optimized tests use FASTER=ON
  2. A new export_hrrr() function in default_vars.sh replaces most of the content of the tests/tests/hrrr_* files.
  3. The FV3_HRRR_gf and FV3_HRRR_c3 suites use the clm lake model.
  4. Initialize some uninitialized arrays.
  5. Use the correct real kind during one step of write grid component communication (fix from @DusanJovic-NOAA)

Linked Issues and Pull Requests

Fixes these three fv3atm issues:

Subcomponent Pull Requests

Blocking Dependencies

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Anticipated Changes

Input data

  • No changes are expected to input data.

Regression Tests:

  • Changes are expected to the following tests:

All tests with "conus13km" in their name are replaced by new ones with shorter names. New tests hrrr_gf and hrrr_c3 are added.

Full list of new tests
COMPILE | rrfs_phy32 | intel | -DAPP=ATM -DFASTER=ON -DCCPP_SUITES=FV3_RAP,FV3_HRRR,FV3_HRRR_gf -D32BIT=ON -DCCPP_32BIT=ON | + hera cheyenne | fv3 |
RUN | conus13km_control             |                            | baseline |
RUN | conus13km_2threads            |                            |          | conus13km_control
RUN | conus13km_restart_mismatch    |                            | baseline | conus13km_control
RUN | conus13km_control_qr          |                            | baseline |
RUN | conus13km_restart_qr_mismatch |                            | baseline | conus13km_control

# Expected to fail:                                                                                                                                                                                                                          
# RUN | conus13km_restart             |                            |          | conus13km_control                                                                                                                                            
# RUN | conus13km_decomp              |                            |          | conus13km_control                                                                                                                                            

COMPILE | rrfs_phy32_debug | intel | -DAPP=ATM -DDEBUG=ON -DCCPP_SUITES=FV3_RAP,FV3_HRRR,FV3_HRRR_gf -D32BIT=ON -DCCPP_32BIT=ON | + hera cheyenne | fv3 |
RUN | conus13km_debug              |                            | baseline |
RUN | conus13km_debug_2threads     |                            |          |
RUN | conus13km_radar_tten_debug   |                            | baseline |

# Expected to fail:                                                                                                                                                                                                                          
# RUN | conus13km_debug_decomp       |                            |          |                                                                                                                                                               

COMPILE | rrfs | intel | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_gf,FV3_HRRR_c3,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON | | fv3 |
RUN | hrrr_gf                       |                            | baseline |
RUN | hrrr_c3                       |                            | baseline |

COMPILE | rrfs_phy32 | gnu | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_HRRR,FV3_HRRR_gf -D32BIT=ON -DCCPP_32BIT=ON | + hera cheyenne | fv3 |
RUN | conus13km_control             | + hera cheyenne            | baseline |
RUN | conus13km_2threads            | + hera cheyenne            |          | conus13km_control
RUN | conus13km_restart_mismatch    | + hera cheyenne            | baseline | conus13km_control
RUN | conus13km_control_qr          | + hera cheyenne            | baseline |
RUN | conus13km_restart_qr_mismatch | + hera cheyenne            | baseline | conus13km_control

# Expected to fail:                                                                                                                                                                                                                          
# RUN | conus13km_restart             | + hera cheyenne            |          | conus13km_control                                                                                                                                            
# RUN | conus13km_decomp              | + hera cheyenne            |          | conus13km_control                                                                                                                                            

COMPILE | rrfs_phy32_debug | gnu | -DAPP=ATM -DDEBUG=ON -DCCPP_SUITES=FV3_RAP,FV3_HRRR,FV3_HRRR_gf -D32BIT=ON -DCCPP_32BIT=ON | + hera cheyenne | fv3 |
RUN | conus13km_debug              | + hera cheyenne            | baseline |
RUN | conus13km_debug_2threads     | + hera cheyenne            |          |
RUN | conus13km_radar_tten_debug   | + hera cheyenne            | baseline |

# Expected to fail:                                                                                                                                                                                                                          
# RUN | conus13km_debug_decomp       | + hera cheyenne            |          |                                                                                                                                                               

COMPILE | rrfs | gnu | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_gf,FV3_HRRR_c3,FV3_RRFS_v1beta -D32BIT=ON | + hera cheyenne | fv3 |
RUN | hrrr_gf                     | + hera cheyenne            | baseline |
RUN | hrrr_c3                     | + hera cheyenne            | baseline |

Libraries

  • Not Needed
Code Managers Log
  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.
    • N/A

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Jet
    • Gaea
    • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@SamuelTrahanNOAA
Copy link
Collaborator Author

This should be combined with #1467. They fix related issues, and both PRs are needed urgently by the RRFS parallels.

@github-actions
Copy link

@SamuelTrahanNOAA please bring these up to date with respective authoritative repositories

  • ufs-weather-model NOT up to date

@SamuelTrahanNOAA
Copy link
Collaborator Author

please bring these up to date with respective authoritative repositories

You are an especially polite bot, github-actions. Yes, I'm working on this now. My regression tests were about to finish from the last pass at updating this branch. Now, I find out that the UFS has updated again since I started. Time to start over, I suppose.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Combining with #1853 would be great too, since that fixes another 32-bit physics quilting restart bug.

@DeniseWorthen
Copy link
Collaborator

@SamuelTrahanNOAA Dusan is on leave, but you could add his changes manually.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Dusan is on leave, but you could add his changes manually.

Well @jkbk2004 didn't want me to merge #1467 to this, but I haven't heard anything from him about #1853

I'd rather not merge in something, just to remove it soon after learning I shouldn't have merged it.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Actually, I'd like to hear from @junwang-noaa about whether to merge #1853 into this. She had doubts about whether Dusan's workaround was the best way forward.

His workaround was to have the ESMF-based write component mimic what is essentially an FMS I/O bug. Specifically, always using double precision to write out axis variables, even if those variables are originally stored in single precision. This way, the quilt and non-quilt restart files will be identical.

The alternative is to accept that quilt and non-quilt axis variables will have different precision. That won't matter anymore once we're using quilting restarts for everything. However, it will make regression testing more difficult until then.

If Jun is here and agrees to use Dusan's workaround, then I can merge #1853. If she'd prefer to wait for Dusan to come back and discuss matters, then I should not merge 1853.

@junwang-noaa
Copy link
Collaborator

@SamuelTrahanNOAA @DusanJovic-NOAA will come back next Monday, let's discuss this with him before we merge it. Thanks

@SamuelTrahanNOAA
Copy link
Collaborator Author

@jkbk2004 - Based on Jun's response, we're not combining any PRs with this one. I have finished my final testing: no output changes, and all new tests pass on Hera. You can proceed with final testing when you are ready.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@junwang-noaa - Certainly, we shall wait on 1853, but I don't want to wait on this PR or #1467 due to the urgency in the RRFS parallels. @jkbk2004 was discussing starting regression tests on this PR tomorrow. Could you review the FV3 PR sometime tomorrow? I believe you're already familiar with the fix in this PR.

@zach1221
Copy link
Collaborator

zach1221 commented Aug 25, 2023

@SamuelTrahanNOAA can you sync up your branch here and resolve conflicts? #1773 is merged so we want to begin working this PR next.

@zach1221
Copy link
Collaborator

All baselines are created on cheyenne except for conus13km_control_qr_intel, conus13km_control_qr_gnu and hrrr_c3_gnu.
hrrr_c3_gnu
image
conus13km cases
image

I tried lowering the TPN on cheyenne to 18 to resolve the hrrr_c3_gnu failure, but it continues.

@SamuelTrahanNOAA
Copy link
Collaborator Author

All baselines are created on cheyenne except for conus13km_control_qr_intel, conus13km_control_qr_gnu and hrrr_c3_gnu.

I need to see the full backtrace of the error to know where and why it is crashing.

@jkbk2004
Copy link
Collaborator

All baselines are created on cheyenne except for conus13km_control_qr_intel, conus13km_control_qr_gnu and hrrr_c3_gnu.

I need to see the full backtrace of the error to know where and why it is crashing.
144: file: module_write_restart_netcdf.F90 line: 452 NetCDF: HDF error
ncerr = nf90_put_var(ncid, varids(i), values=array_r4_3d, start=start_idx); NC_ERR_STOP(ncerr)

@jkbk2004
Copy link
Collaborator

err.log

@SamuelTrahanNOAA
Copy link
Collaborator Author

At this point, I'd like to disable all three tests.

The c3 scheme is experimental, and may have unknown bugs. Thankfully, we now have a way to reproduce a crash, even if it is on a machine few developers can access.

Quilting restart does not work for 32-bit physics. ESMF accesses uninitialized memory extensively during quilt server initialization. This is easy to detect on any platform using valgrind. The result is gibberish data sent to the NetCDF library. If that gibberish includes a signalling NaN, then the model will crash. Whether this is a bug in ESMF, or the model, I do not know. Neither would surprise me. This branch has some bug fixes for 32-bit physics with quilting restart, but not enough to get it to work.

Now that @DusanJovic-NOAA has returned, I'm hoping he can help me get the quilting restart to work with 32-bit physics. Until then, disabling a test of a known-broken feature makes the most sense.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I've disabled these tests:

  1. hrrr_c3_gnu (I retained _intel)
  2. conus13km_control_qr (both gnu and intel)
  3. conus13km_restart_qr_mismatch (both gnu and intel)

@zach1221
Copy link
Collaborator

Ok, I'll run the final matching then on Cheyenne with those cases turned off and then we can merge this PR.

@jkbk2004
Copy link
Collaborator

Created an issue to follow up later: #1882

@BrianCurtis-NOAA
Copy link
Collaborator

wcoss2: tests are hitting wall clock both with baseline creation and comparison. No consistency in which test fails, so it's most likely from high load slowing things down. I'm about to re-run the failed tests (2) and logs will come soon.

@zach1221
Copy link
Collaborator

Update on Cheyenne. The tests just finished a few minutes ago, however one case failed to match against the baseline. Hoping to have it resolved shortly.

@zach1221
Copy link
Collaborator

Testing is complete. I'll follow up on the FV3atm sub-pr

@BrianCurtis-NOAA
Copy link
Collaborator

@SamuelTrahanNOAA FV3 hash: NOAA-EMC/fv3atm@51e570c

@SamuelTrahanNOAA
Copy link
Collaborator Author

My branch points to the head of FV3 develop, and I've reverted .gitmodules.

@zach1221 zach1221 merged commit 03fad6f into ufs-community:develop Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container Priority: High Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants