Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Changes for Derecho, a new platform #894

Merged
merged 16 commits into from
Sep 19, 2023

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Aug 23, 2023

Modulefile and other configuration files to adapt the SRW to Derecho system.

Software stacks used for testing are hdf5/1.14.0, netcdf/4.9.2-based, similar to those used in #889.

DESCRIPTION OF CHANGES:

Adding Derecho system at UCAR/NCAR at Tier-1 machine.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

All fundamental tests pass.

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

This PR will resolve the issue 884:
#884
This PR depends on #889 - MERGED

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation: add as a new platform.
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@mark-a-potts

Fundamental tests are successful.
#894 (comment)

WE2E_summary_20230823001411.txt
WE2E_summary_20230823013603.txt

etc/lmod-setup.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@MichaelLueken MichaelLueken linked an issue Aug 23, 2023 that may be closed by this pull request
@MichaelLueken MichaelLueken changed the title Changes for Derecho, a new platform [develop] Changes for Derecho, a new platform Aug 23, 2023
@MichaelLueken MichaelLueken added the help wanted Extra attention is needed label Aug 23, 2023
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - Thanks for opening this PR to allow the SRW App to build and run on Derecho!

Since Cheyenne will be decommissioned at the end of the year and given that the NRAL0032 account is out of resources on Cheyenne, should we keep Cheyenne in the various files still, or would it be best to fully transition to Derecho?

If we fully cut support for Cheyenne and fully transition to Derecho, then the modification made in ush/get_crontab_contents.py can be changed so that line 61 would read:

if MACHINE == "DERECHO"

which should allow the Python unittests to pass (currently, the Python unittests are failing in test_get_crontab_contents because the crontab_cmd is being set as usr/bin/crontab rather than crontab).

Has an EPIC Platform ticket been created to create a new Derecho pipeline so that we can add Derecho to the .cicd/Jenkinsfile to run the automated tests on the new platform? If not, please let me know and I can open a ticket for this work.

etc/lmod-setup.csh Outdated Show resolved Hide resolved
@natalie-perlin
Copy link
Collaborator Author

#826

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my other suggested change, we should remove all the nodesize: lines from all the files in parm/wflow/. It turns out this <nodesize> tag in the Rocoto XML actually does nothing without a corresponding <cores> tag, which we do not have. And the newer Rocoto build on Derecho gives a bunch of deprecation warnings for this tag each time you run rocotorun, so we should just get rid of it.

Negative news aside, I did confirm I was able to run tests successfully on Derecho! So hopefully once these changes are addressed and the latest development merged in this will be good to go.

Thank you, @mkavulich!

Co-authored-by: Michael Kavulich <kavulich@ucar.edu>
@natalie-perlin
Copy link
Collaborator Author

@mkavulich - addressed your comments on yaml files in wflow/ directory

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about those late comments, thanks for addressing them!

@natalie-perlin
Copy link
Collaborator Author

Merged changes from develop, and tested without additional cmake options file for UFS WM. After fixing a default for EXTRN_MDL_DATA_STORES: aws in ./ush/machine/derecho.yaml, all the fundamental test have passed.

(before correcting derecho.yaml):

All 7 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              18.17
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  DEAD                   0.00
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              21.46
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              27.50
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              33.77
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              31.37
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              46.74
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                 179.01

Detailed summary written to /glade/derecho/scratch/nperlin/SRW/expt_dirs/WE2E_summary_20230915113527.txt

after correcting derecho.yaml:

All 1 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              22.87
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              22.87

Detailed summary written to /glade/derecho/scratch/nperlin/SRW/expt_dirs/NCO/WE2E_summary_20230915124841.txt

@natalie-perlin
Copy link
Collaborator Author

Running comprehensive tests now on Derecho.

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - are there any additional tests needed for Derecho? As to CI/CD we may not have the account yet.

@natalie-perlin natalie-perlin removed the help wanted Extra attention is needed label Sep 15, 2023
@natalie-perlin
Copy link
Collaborator Author

Comprehesive tests:

All 62 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD                                                           COMPLETE              33.56
community                                                          COMPLETE              39.81
custom_ESGgrid                                                     COMPLETE              14.23
custom_ESGgrid_Central_Asia_3km                                    DEAD                   0.61
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              21.27
custom_ESGgrid_NewZealand_3km                                      DEAD                   0.71
custom_ESGgrid_Peru_12km                                           COMPLETE              21.60
custom_ESGgrid_SF_1p1km                                            COMPLETE             145.00
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE              10.07
custom_GFDLgrid                                                    COMPLETE               9.64
deactivate_tasks                                                   COMPLETE               0.74
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             636.35
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS                             COMPLETE              12.40
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              16.40
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   DEAD                   1.96
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        DEAD                   0.66
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             172.99
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              30.66
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              34.19
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              30.51
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              30.70
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              11.95
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              25.66
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.64
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              36.98
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              71.57
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              38.72
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              17.23
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE              10.89
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              42.13
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              34.03
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             211.87
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             300.27
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             306.42
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             320.71
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             318.94
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              30.74
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              25.93
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              25.33
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              19.45
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              29.55
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              16.47
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             251.78
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             264.03
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta     COMPLETE             263.06
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE              75.60
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              30.48
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              38.02
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              30.28
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              48.94
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot      COMPLETE              15.62
MET_ensemble_verification_only_vx                                  COMPLETE               0.68
MET_verification_only_vx                                           COMPLETE               0.14
nco                                                                COMPLETE              19.69
nco_ensemble                                                       COMPLETE             102.45
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              29.47
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              22.35
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             292.29
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       DEAD                   1.03
pregen_grid_orog_sfc_climo                                         COMPLETE              12.30
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE              10.88
specify_template_filenames                                         COMPLETE              13.38
----------------------------------------------------------------------------------------------------

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - With the decommissioning of Cheyenne, using the coverage.cheyenne test suite for Derecho is perfectly fine. No additional tests need to be added for Derecho with this PR.

Are there plans to add GNU to Derecho at a later time? If there are plans, then we can bring in the coverage.cheyenne.gnu tests as coverage.derecho.gnu.

I'm wrapping up my testing of the Jenkins build and run scripts to ensure that the SRW will build and run using these on Derecho. Additionally, this will also test the coverage suite for the machine. Once they pass, I will give my approval and test the rest of the systems using Jenkins.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - The SRW App successfully builds on Derecho using the Jenkins .cicd/scripts/srw_build.sh script. Additionally, the coverage.derecho tests were successfully run using .cicd/scripts/srw_test.sh and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              21.29
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              35.41
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              42.15
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              26.62
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              16.56
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              38.69
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              23.09
pregen_grid_orog_sfc_climo                                         COMPLETE              12.96
specify_template_filenames                                         COMPLETE              14.32
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             231.09

Approving this PR now and running the Jenkins tests for the rest of the platforms (since there is no Jenkins runner for Derecho at this time).

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Sep 18, 2023
@MichaelLueken
Copy link
Collaborator

The Jenkins Hera Intel WE2E coverage tests failed for custom_ESGgrid_Central_Asia_3km:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    DEAD                   4.73
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               5.59
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             751.95
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              13.56
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               5.77
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              12.29
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE               9.43
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.10
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             227.80
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             299.96
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             348.04
pregen_grid_orog_sfc_climo                                         COMPLETE               6.15
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                1691.37

It failed with a strange NetCDF failure:

FATAL from PE 18: NetCDF: Start+count exceeds dimension bound: netcdf_read_data_2d: file:INPUT/gfs_data.nc- variable:ps

A rerun of the test was successful:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              23.72
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              23.72

The Orion and Gaea Jenkins tests have successfully passed. Awaiting completion of Hera GNU and Jet tests now.

@MichaelLueken
Copy link
Collaborator

Both the Hera GNU and Jet WE2E coverage tests successfully passed on Jenkins. Now moving forward with merging this work.

@MichaelLueken MichaelLueken merged commit fc0403e into ufs-community:develop Sep 19, 2023
4 of 5 checks passed
@natalie-perlin natalie-perlin deleted the develop_derecho branch October 13, 2023 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Derecho to supported platforms, as Tier-1 system
4 participants