Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix: disable concurrency in GFS_phys_time_vary_init NetCDF calls AND #2040 and #2043 #2041

Merged
merged 37 commits into from
Dec 19, 2023

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Dec 15, 2023

PR Author Checklist:

  • I have linked PR's from all sub-components involved in section below.
  • I am confirming reviews are completed in ALL sub-component PR's.
  • I have run the full RT suite on either Hera/Hercules AND have attached the log to this PR below this line:
    • LOG: committed to branch
  • I have added the list of all failed regression tests to "Anticipated changes" section.
  • I have filled out all sections of the template.

Description

NOTE: Two PRs are included in this one

Occasionally, the model will fail in FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90 during the first physics timestep. This is caused by heap corruption earlier, during gfs_phys_time_vary_init. After much debugging, I tracked down the problem.

The NetCDF library is not thread-safe:

The C-based libraries are not thread-safe. C-based libraries are those that depend on the C library, which currently include all language interfaces except for the Java interface

We're calling a non-thread-safe library in a threaded region. There are multiple NetCDF calls going concurrently. The fix is to read the NetCDF files one at a time.

Commit Message

General clean-up, and CCPP concurrency bug fixes

  1. Remove nfhout, nfhmax_hf, nfhout_hf and nsout from fv3atm and the regression tests. ( Remove nfhout, nfhmax_hf, nfhout_hf etc configuration parameters NOAA-EMC/fv3atm#731 )
  2. Add comments to smc pert and fix bug in stc pert ( Fix bugs in soil temperature perturbation #2042 )
  3. Disable concurrency in NetCDF calls within CCPP GFS_phys_time_vary_init subroutine to avoid crashes

Linked Issues and Pull Requests

Associated UFSWM Issue to close

These two PRs are combined with this one. They should be closed when this is merged.

Subcomponent Pull Requests

Blocking Dependencies

None

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Anticipated Changes

Input data

  • No changes are expected to input data.

Regression Tests:

  • No changes are expected to any regression test.

Libraries

  • Not Needed
Code Managers Log
  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.
    • N/A

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@SamuelTrahanNOAA
Copy link
Collaborator Author

This PR is innocuous and self-contained. It can be combined with just about anything. Ideally, it should be merged with PRs that don't change the results.

@BrianCurtis-NOAA
Copy link
Collaborator

@SamuelTrahanNOAA I've been talking with the NetCDF folks and they have been working on a threadsafe NetCDF, and we're just finding out if that work has been merged or will be soon. The process to move to whatever version of NetCDF that is will of course take a while. That was FYI. My question is if this code is easily "reversible" once the thread-safe NetCDF is available?

@SamuelTrahanNOAA
Copy link
Collaborator Author

My question is if this code is easily "reversible" once the thread-safe NetCDF is available?

The thread-safe NetCDF may not be available on all machines. We would need a way to turn on and off threading during that region.

Turning it back on 100% of the time is a simple matter of putting the OpenMP directives back in. I don't think we'll want it on unconditionally.

How to turn it on conditionally... we need to research and design. The NetCDF library would have to be able to report that it is thread safe. (Even if the code is capable of being built thread-safe, it may not necessarily build that way everywhere. It might be a compile-time flag when building the library.) Certainly, we need a flag to force the code to run single-threaded regardless of NetCDF library, in case something goes wrong. And then there's the matter of deciding how to turn off threading in that region.

@SamuelTrahanNOAA
Copy link
Collaborator Author

There's one other consideration. We're better off using netcdf-parallel to read across all ranks. It'll stress the filesystem less than we're doing now. Ultimately, it is the fastest option. This would be done at the model level, where we read most fix files already.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I've merged the top of #2040 and #2043 into this PR's branch.

Unfortunately, two hashes mismatch for #2043:

The one in #2043 doesn't exist in the repository. I'm pointing to NOAA-PSL/stochastic_physics#70 instead. I need the correct hash from @yuanxue2870 before I continue testing.

@SamuelTrahanNOAA
Copy link
Collaborator Author

It turns out this hash is correct:

See that page for the discussion.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@DusanJovic-NOAA - Could you please confirm I merged your changes into this branch correctly?

@DusanJovic-NOAA
Copy link
Collaborator

I think you need to update .gitmodules to point to your fv3atm fork/branch.

@SamuelTrahanNOAA SamuelTrahanNOAA changed the title bug fix: disable concurrency in GFS_phys_time_vary_init NetCDF calls bug fix: disable concurrency in GFS_phys_time_vary_init NetCDF calls AND #2040 and #2043 Dec 18, 2023
@zach1221
Copy link
Collaborator

zach1221 commented Dec 19, 2023

I'm still getting a failure for regional_atmaq on Gaea C5, here:
/lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_151345/regional_atmaq_intel/ , where the srun just terminates.
Edit: attempting to adjust the TPN in the tests/tests/regional_atmaq file to see if that resolves

And the regional_netcdf_parallel_intel failure on Hercules from issue 2015 continues, and I haven't been able to get it to pass yet. It seems to be more persistent than in the past.
Edit: there may be a missing baseline for regional_netcdf_parallel_intel on hercules.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I'm still getting a failure for regional_atmaq on Gaea C5, here: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_151345/regional_atmaq_intel/ , where the srun just terminates. Edit: attempting to adjust the TPN in the tests/tests/regional_atmaq file to see if that resolves

Nothing in this PR should affect that job.

I have a ticket open for a similar UPP failure. With certain inputs, UPP will freeze forever in MPI_Finalize. This only happens on GAEA-C5. The admins have passed the ticket on to the vendor. I haven't heard back in several weeks.

And the regional_netcdf_parallel_intel failure on Hercules from issue 2015 continues, and I haven't been able to get it to pass yet. It seems to be more persistent than in the past. Edit: there may be a missing baseline for regional_netcdf_parallel_intel on hercules.

The NetCDF fixes in here are for reading NetCDF files, not writing. I don't expect it to fix #2015

@SamuelTrahanNOAA
Copy link
Collaborator Author

I added a commit message to the PR description:

General clean-up, and CCPP concurrency bug fixes
1. Remove nfhout, nfhmax_hf, nfhout_hf and nsout from fv3atm and the regression tests.
2. Add comments to smc pert and fix bug in stc pert
3. Disable concurrency in NetCDF calls within CCPP GFS_phys_time_vary_init subroutine to avoid crashes

@SamuelTrahanNOAA
Copy link
Collaborator Author

Could @DusanJovic-NOAA and @yuanxue2870 please review the commit message?

General clean-up, and CCPP concurrency bug fixes
1. Remove nfhout, nfhmax_hf, nfhout_hf and nsout from fv3atm and the regression tests.
2. Add comments to smc pert and fix bug in stc pert
3. Disable concurrency in NetCDF calls within CCPP GFS_phys_time_vary_init subroutine to avoid crashes

When this PR is merged, that text will be the description of the changes recorded forever in the repository.

@DeniseWorthen
Copy link
Collaborator

@SamuelTrahanNOAA You make a good point about "combined" PRs. In the future, the PR selected to carry the individual PRs will be able to just grab and add.

@DusanJovic-NOAA
Copy link
Collaborator

Could @DusanJovic-NOAA and @yuanxue2870 please review the commit message?

General clean-up, and CCPP concurrency bug fixes
1. Remove nfhout, nfhmax_hf, nfhout_hf and nsout from fv3atm and the regression tests.
2. Add comments to smc pert and fix bug in stc pert
3. Disable concurrency in NetCDF calls within CCPP GFS_phys_time_vary_init subroutine to avoid crashes

When this PR is merged, that text will be the description of the changes recorded forever in the repository.

Looks good.

@yuanxue2870
Copy link

yuanxue2870 commented Dec 19, 2023 via email

@BrianCurtis-NOAA
Copy link
Collaborator

General question here about commit messages, since this PR is next. Should we include in all commit messages which subcomponents were changed, say a line like "Components Updated: FV3, stochastic physics" ?

@BrianCurtis-NOAA
Copy link
Collaborator

@SamuelTrahanNOAA I've made changes to your commit messages as I want to see how the links for the issues travel inside the commit message. I hope that this works.

@SamuelTrahanNOAA
Copy link
Collaborator Author

This PR is ready for final testing and merge. I have reverted .gitmodules. All subcomponent hashes match the head of their correct branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jenkins-ci Jenkins CI: ORT build/test on docker container No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants