Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update orion stack path to use a new role-epic account location #1846

Closed

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Jul 31, 2023

PR Author Checklist:

  • I have linked PR's from all sub-components involved in section below.
  • I am confirming reviews are completed in ALL sub-component PR's.
  • I have run the full RT suite on either Hera/Cheyenne AND have attached the log to this PR below this line:
    • LOG:
  • I have added the list of all failed regression tests to "Anticipated changes" section.
  • I have filled out all sections of the template.

Description

This PR uses a new software stack location on Orion that has been built in a new role-epic location. All the software libraries have been built in
/work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/

A regression test control_p8 has been run using a new location, successfully passed, and the log is attached to this PR.
RegressionTests_orion.log.txt

UPDATE following #1745 merge:
Stack location needs to be /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2_ncdf492 on Orion

Linked Issues and Pull Requests

This PR is linked to the issue #1857

Associated UFSWM Issue to close

Subcomponent Pull Requests

Blocking Dependencies

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Anticipated Changes

Input data

  • No changes are expected to input data.

Regression Tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:

Libraries

  • Not Needed
  • Needed
    • Create separate issue in JCSDA/spack-stack asking for update to library. Include library name, library version.
    • Add issue link from JCSDA/spack-stack following this item
Code Managers Log
  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.
    • N/A

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Jet
    • Gaea
    • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment
[RegressionTests_orion.log](https://github.com/ufs-community/ufs-weather-model/files/12221632/RegressionTests_orion.log)

@natalie-perlin
Copy link
Collaborator Author

All Regression tests have been run on Orion, log attached.
All pass except for the following six as shown below that reported permission errors to access HAFS input files in /work/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20221101/FV3_hafs_input_data/.

The errors looks as following:
cp: cannot open ‘/work/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20221101/FV3_hafs_input_data/INPUT_hafs_regional_storm_following_1nest_atm/ferret.jnl’ for reading: Permission denied
All the failed tests report similar error, except for the corresponding experimentn name: INPUT_hafs_regional_<experiment_name>.

hafs_regional_specified_moving_1nest_atm_intel
hafs_regional_storm_following_1nest_atm_intel
hafs_regional_storm_following_1nest_atm_qr_intel
hafs_regional_storm_following_1nest_atm_ocn_intel
hafs_regional_storm_following_1nest_atm_ocn_debug_intel
hafs_regional_storm_following_1nest_atm_ocn_wav_intel
RegressionTests_orion_ALL.log.txt

DeniseWorthen and others added 2 commits August 2, 2023 16:13
* reverting the bug fix for ktherm=2 allows all cpld tests to pass
and the single datm test using ktherm=2 (datm_cdeps_gfs) to also pass
all other datm tests which use ktherm=1 fail

* update CICE

* change freq in for ice_diag.d global values

* remove unused history* settings

* update DISKNM with epic and rt log

* Update bl.py for new Hera blstore
* add gridtype to gocart CAP.rc

* move to GOCART 20230227 version with threading capability

* udpate to netcdf/4.9.2 and add threading capability for gocart

* update esmf library

* update GOCART to allow no-Nitrates run
Natalie Perlin added 2 commits August 9, 2023 16:31
@natalie-perlin
Copy link
Collaborator Author

Updated the modulefile for Orion in ./modulefiles/ufs_orion.intel.lua, which included use of a new stack location in /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2_ncdf492/modulefiles/stack AND loading miniconda3 from /work/noaa/epic/role-epic/contrib/orion/miniconda3/modulefiles

Testing is undergoing.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Aug 10, 2023

@jkbk2004 -
After updating the modulefile for Orion, all but 14 regression test successfully PASS. Out of the 14 marked as "failed":

  • 8 tests successfully finished :
cpld_control_qr_p8_intel 005
control_CubedSphereGrid_parallel_intel 028
control_qr_p8_intel 042
hrrr_control_qr_intel 069
rrfs_smoke_conus13km_hrrr_warm_qr_intel 
hafs_regional_1nest_atm_qr_intel 142
hafs_global_1nest_atm_qr_intel 145
hafs_global_multiple_4nests_atm_qr_intel 147 
  • other 4 tests failed due to pemission denied errors, attempting to read /work/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20221101/FV3_hafs_input_data/INPUT_hafs_regional_storm_following_1nest_atm/ferret.jnl
hafs_regional_specified_moving_1nest_atm_intel 148
hafs_regional_storm_following_1nest_atm_qr_intel 150
 hafs_regional_storm_following_1nest_atm_ocn_intel 151   
 hafs_regional_storm_following_1nest_atm_ocn_debug_intel 153      

Note that all the errors for all four tests list the same file not being able to read, regardless of the testname. There could be an error, as the name of the file does not seem to correspond to a test name.

Tests could be found in /work/noaa/epic/nperlin/stmp/nperlin/FV3_RT/rt_204579. Rocotostat output showing statuses of the tests is attached.
Orion.modulefile.update.tests.txt

@DeniseWorthen
Copy link
Collaborator

@natalie-perlin That file (ferret.jnl) should not be present in any baseline; it must have been accidentally added at some point (by @binli2337). The file should be removed.

@natalie-perlin
Copy link
Collaborator Author

@DeniseWorthen - thank you for the note! That should be helpful for regression tests on all the systems. Who could assist with the corresponding changes in the test scripts?..

@DeniseWorthen
Copy link
Collaborator

@natalie-perlin Once you remove the file, that should be all you need to do.

@natalie-perlin
Copy link
Collaborator Author

@DeniseWorthen @jkbk2004 @binli2337
I do not have authorization to change files in baselines. Who could help with that, so the progress could be made with this PR, and thus the issue #1857 resolved?

@DeniseWorthen
Copy link
Collaborator

@natalie-perlin It suspect only @binli2337 can remove the file (because of how orion is setup), and he is on AL through the end of the month.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Aug 24, 2023

@jkbk2004 - Could be closed following the merge of #1857 that uses new software stack

@jkbk2004
Copy link
Collaborator

@natalie-perlin We moved to Spack stack on Orion. Can we close this PR?

@jkbk2004 jkbk2004 closed this Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants