Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update weather model hash and remove "_vrfy" from bash commands #1074

Merged

Conversation

MichaelLueken
Copy link
Collaborator

DESCRIPTION OF CHANGES:

The weather model hash has been updated to 4f32a4b (April 15).

Additionally, _vrfy has been removed from the cd, cp, ln, mkdir, mv, and rm bash commands in jobs, scripts, ush, and ush/bash_utils. The modified commands don't function as intended (issue #861) and aren't accepted by NCO (issue #1021).

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

  • hera.intel
  • hera.gnu
  • orion.intel
  • hercules.intel
  • derecho.intel
  • gaea.intel
  • jet.intel
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

The fundamental tests were run on all tier-1 platforms. Comprehensive tests were run on Hera Intel, Orion, and Hercules. The AQM WE2E test was run on Hera Intel and the sample warm start AQM configuration was run on Hercules.

DEPENDENCIES:

None

DOCUMENTATION:

None

ISSUE:

Fixes #861
Fixes #1021

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes

…and remove _vrfy from bash cp, mv, rm, ln, cd, and mkdir commands
…nd removed ush/bash_utils/filesys_cmds_vrfy.sh reference from ush/source_util_funcs.sh
Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Apr 19, 2024
@chan-hoo
Copy link
Collaborator

@MichaelLueken, thanks for this change:

Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240419100352                   COMPLETE            4829.55
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4829.55

Approving.

@gsketefian
Copy link
Collaborator

@MichaelLueken It would probably be a good idea to keep what these were meant to do, which is, in bash scripts, check whether the cp, mv, etc functions were successful. These functions originally returned an error message and stopped execution, so that every time one needs to use say cp, one doesn't have to put in a line like this:

cp file1 file2 || {print "Copy failed"; exit 1}

or something longer, and instead use

cp_vrfy file1 file2

I saw in issue #861 that:

For example, if you want to verify a file was copied with cp_vrfy, but that file does not exist, a message will be printed out, but the exit code will be set to 0, and the program will happily hum along with no failure. 

Obviously, this wasn't the original intent or behavior of these functions, but this erroneous behavior was introduced somewhere along the line due to insufficient testing. So instead of just removing these, what should be done is restore the original functionality of detecting errors and exiting.

@chan-hoo
Copy link
Collaborator

@gsketefian, as far as I know (if my memory serves me correctly), the bash commands like cp exit with an error message when they fail (while the current _vrfy function does not exit; this was a big issue in my previous runs). For delivery to NCO, the -p flag should be added to the copy command like (cp -p) though. So, I think cp file1 file2 is good enough.

@gsketefian
Copy link
Collaborator

gsketefian commented Apr 19, 2024 via email

@MichaelLueken
Copy link
Collaborator Author

@gsketefian -

Thanks for expressing your concern on the removal of _vrfy from the bash commands. With this PR, I was simply following through with @mkavulich's and @chan-hoo's recommendation of fully removing the _vrfy. As @chan-hoo noted, The standard bash commands should exit automatically on a failure.

Would it be worthwhile to surround cd, cp, ln, mkdir, mv, and rm commands with set -x and set +x, so that they should error out if an issue is encountered? I'm open to more opinions and suggestions, but I don't think that we want to go back to using _vrfy once again.

@gsketefian
Copy link
Collaborator

gsketefian commented Apr 19, 2024 via email

@mkavulich
Copy link
Collaborator

mkavulich commented Apr 25, 2024

@gsketefian Since it was my issue originally advocating for removal of these scripts I'll provide a defense: I don't see a reason to carry this burden and overhead of custom wrappers around simple shell builtins like this, and as I say in the original issue, they introduce additional problems! Unless I'm missing something, they provide no additional value over just using the -e flag within exscripts (which is something we should think about doing!), and just add further things into the workflow that need maintenance, can cause issues, and (as they are doing now) can mask other issues.

For the record, here is an example of a failure using the shell builtin mv:


mv: cannot stat 'fire.nml': No such file or directory
FATAL ERROR:
ERROR:
  From script:  "JREGIONAL_RUN_FCST"
  Full path to script:  "/glade/derecho/scratch/kavulich/FIRE/2024/april_updates/ufs-srweather-app/jobs/JREGIONAL_RUN_FCST"
Call to ex-script corresponding to J-job "JREGIONAL_RUN_FCST" failed.
Exiting with nonzero status.

And here is with the custom mv_vrfy, after it is "fixed" by adding set -e:



"mv_vrfy" operation returned with a message.  This command was
issued from the script in file:

  ""

Message from "mv_vrfy" function's "mv" operation:
  mv: cannot stat 'fire.nml': No such file or directory

========================================================================
Exiting script:  "JREGIONAL_RUN_FCST"
In directory:    "/glade/derecho/scratch/kavulich/FIRE/2024/april_updates/ufs-srweather-app/jobs"
========================================================================

That is nothing but a whole bunch of unhelpful extra text around the same error message. It seems as if it was meant to echo the script it was called from, but again, it does not do this! And that information would be mostly redundant anyway, since exscripts all print a message when they start.

I did not realize until this PR that set -e is not turned on in all scripts; regardless of the outcome of this discussion we should see if we need to do this as well. I did notice that certain commands (mv as one example) will fail even with the -e flag not set.

Is this a good time to mention that converting everything to python would make this problem moot? 😉

@MichaelLueken
Copy link
Collaborator Author

All scripts load the ush/preamble.sh script. Inside the ush/preamble.sh script, the default behavior is to use set -euo pipefail. No other changes should be required, as set -euo is already on by default for all scripts.

Addressing issue related to failures with the grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 metric test, then will push updates and resubmit Jenkins tests.

…h script to allow the test to run successfully while ran as part of the coverage and comprehensive test suite
@MichaelLueken
Copy link
Collaborator Author

After correcting the grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 test, I have resubmitted the Jenkins tests.

Gaea, Jet, and Orion are currently not available via Jenkins, so I will go ahead and manually run the WE2E coverage tests on those machines.

@MichaelLueken
Copy link
Collaborator Author

The WE2E tests were manually run on Orion and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km_20240429084427                             COMPLETE             437.36
deactivate_tasks_20240429084428                                    COMPLETE               0.93
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            1878.41
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE            1006.63
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             374.77
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202404290  COMPLETE              21.40
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240429084  COMPLETE             971.98
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              64.03
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             728.04
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202404  COMPLETE              46.58
2020_CAD_20240429084435                                            COMPLETE              71.58
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            5601.71

Additionally, the metric test was also ran and successfully completed:

COL_NAME:                           FCST_MODEL                REF_MODEL N_INIT N_TERM N_VLD SS_INDEX
 GO_INDEX FV3_WoFS_v0_SUBCONUS_3km_test_mem000 FV3_GFS_v16_SUBCONUS_3km      1     17    17  0.99807

@MichaelLueken
Copy link
Collaborator Author

The Jet WE2E coverage tests were manually run and all passed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240429155928                                           COMPLETE              17.51
custom_ESGgrid_20240429155929                                      COMPLETE              25.79
custom_ESGgrid_Great_Lakes_snow_8km_20240429155930                 COMPLETE              20.44
custom_GFDLgrid_20240429155932                                     COMPLETE              10.79
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202404  COMPLETE               8.00
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              84.78
get_from_HPSS_ics_RAP_lbcs_RAP_20240429155935                      COMPLETE              15.93
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240429155936  COMPLETE             617.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              64.04
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.36
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             927.52
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1799.52

Awaiting results from Gaea and Hera GNU now.

@MichaelLueken
Copy link
Collaborator Author

All Jenkins tests have successfully passed and the Gaea WE2E coverage tests were manually run and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240429144427                                           COMPLETE              46.66
custom_ESGgrid_NewZealand_3km_20240429144429                       COMPLETE             113.58
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              45.98
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240429144  COMPLETE              50.24
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024042914  COMPLETE              49.07
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             589.88
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024042  COMPLETE              38.81
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             810.44
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              21.65
2020_CAPE_20240429144455                                           COMPLETE              50.25
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1816.56

Merging this work now.

@MichaelLueken MichaelLueken merged commit eea4c29 into ufs-community:develop Apr 30, 2024
4 of 5 checks passed
@MichaelLueken MichaelLueken deleted the feature/hash_update branch April 30, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
5 participants