Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Expand forecast fields for metric test #1048

Merged
merged 13 commits into from
Mar 14, 2024

Conversation

EdwardSnyder-NOAA
Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA commented Mar 4, 2024

DESCRIPTION OF CHANGES:

This PR expands the number of forecast fields for the Skill Score metric test. The forecast length in the metric WE2E test was extended to 12 hours so that the RMSE metric can be calculated for these additional forecast fields:

  • Specific humidity for the full column
  • Temperature for the full column
  • Wind for the full column
  • Dew point, pressure, temperature, and wind at the surface level for forecast hour 12.

Adding these additional forecast fields will make the skill score metric test more thorough and thus making it a more inclusive test to compare against.

Also, a change was made to the .cicd/scripts/srw_metric_example.sh script to reflect the new conda environment.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

Those interested in running the .cicd/scripts/srw_metric_example.sh will need to do the following below. Note, this script builds the app, so this process can run after running manage_externals.

  1. export WORKSPACE=(path of your ufs-srweather-app folder)
  2. export SRW_PLATFORM=(e.g., orion)
  3. export SRW_COMPILER=(e.g., intel)
  4. export SRW_PROJECT=(e.g., epic-ps)
  5. run script: ./.cicd/scripts/srw_metric_example.sh
  • hera.intel
  • orion.intel
  • hercules.intel
  • derecho.intel
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform) ran on AWS
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain). The details of the metric test aren't documented.
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA -

Overall, these changes look good!

However, while attempting to run the srw_metric_example.sh script, the script failed while attempting to load the wflow_{$platform,,} modulefile. Conda needs to be available before loading the modulefile, otherwise the script will fail. I found that moving the #build srw section to before loading the modulefiles, the script successfully ran.

.cicd/scripts/srw_metric_example.sh Show resolved Hide resolved
@RatkoVasic-NOAA
Copy link
Collaborator

WE2E fundamental tests passed on Hera and Jet:

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              10.29
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              14.06
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.54
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              15.44
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              24.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240304192  COMPLETE              21.40
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030419291  COMPLETE              22.70
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             117.12

Detailed summary written to /mnt/lfs4/HFIP/hfv3gfs/Ratko.Vasic/1048/expt_dirs/WE2E_summary_20240304221958.txt

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               8.87
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              12.09
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.18
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              13.30
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              26.23
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240304192  COMPLETE              13.32
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030419284  COMPLETE              19.29
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             100.28

Detailed summary written to /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/expt_dirs/WE2E_summary_20240304230434.txt

@RatkoVasic-NOAA
Copy link
Collaborator

After this commands:

  export WORKSPACE=/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app
  export SRW_PLATFORM=hera
  export SRW_COMPILER=intel
  export ACCOUNT=epic
  ./.cicd/scripts/srw_metric_example.sh

Shell failed with:

+ set -e -u
+ cd /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests
./.cicd/scripts/srw_metric_example.sh: line 53: cd: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests: No such file or directory

Script ./.cicd/scripts/srw_metric_example.sh, line 20, looks like this:

declare workspace
if [[ -n "${WORKSPACE}/${SRW_PLATFORM}" ]]; then
    workspace="${WORKSPACE}/${SRW_PLATFORM}"
else
    workspace="$(cd -- "${script_dir}/../.." && pwd)"
fi

Variable workspace is getting value from SRW_PLATFORM.
It looks like you should replace -n (true if there are characters in variable) with -d (true if directory exists).

@RatkoVasic-NOAA
Copy link
Collaborator

Also, variable SRW_PROJECT should be set.
If not, account will be set to "no_account":
<!ENTITY ACCOUNT "no_account">

@EdwardSnyder-NOAA
Copy link
Collaborator Author

After this commands:

  export WORKSPACE=/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app
  export SRW_PLATFORM=hera
  export SRW_COMPILER=intel
  export ACCOUNT=epic
  ./.cicd/scripts/srw_metric_example.sh

Shell failed with:

+ set -e -u
+ cd /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests
./.cicd/scripts/srw_metric_example.sh: line 53: cd: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests: No such file or directory

Script ./.cicd/scripts/srw_metric_example.sh, line 20, looks like this:

declare workspace
if [[ -n "${WORKSPACE}/${SRW_PLATFORM}" ]]; then
    workspace="${WORKSPACE}/${SRW_PLATFORM}"
else
    workspace="$(cd -- "${script_dir}/../.." && pwd)"
fi

Variable workspace is getting value from SRW_PLATFORM. It looks like you should replace -n (true if there are characters in variable) with -d (true if directory exists).

This logic was added to address shared workspaces for Gaea/Gaea-c5 and Hercules/Orion. I checked a number of T1 platforms to see if the SRW_PLATFORM directory exists and found that it only does for Gaea and Hercules/Orion. Given that this variable is a required argument, I'll change the logic to "-d" to avoid errors for non-shared workspace platforms.

@EdwardSnyder-NOAA
Copy link
Collaborator Author

Also, variable SRW_PROJECT should be set. If not, account will be set to "no_account": <!ENTITY ACCOUNT "no_account">

Another good find @RatkoVasic-NOAA! Somehow the experiment passed for me with no account on PW AWS. To resolve this, simply export SRW_PROJECT=<e.g. epic> instead of exporting ACCOUNT. I updated the directions in the PR.

@RatkoVasic-NOAA
Copy link
Collaborator

I tested PR on 5 machines, 3 machines passed (Hera, Jet and Hercules), and two machines failed (Orion and Gaea):

Hercules:
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Jet:
+ [[ 0.9855 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Hera:
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Gaea:
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "python/3.10.8"
   Try: "module spider python/3.10.8" to see how to load the module(s).
Orion:
+++ . /apps/intel-2022.1.2/intel-2022.1.2/intelpython/latest/etc/conda/deactivate.d/xgboost_deactivate.sh
/apps/intel-2022.1.2/intel-2022.1.2/intelpython/latest/etc/conda/deactivate.d/xgboost_deactivate.sh: line 16: OCL_ICD_FILENAMES_RESET: unbound variable

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented Mar 7, 2024

And Orion:
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'

@EdwardSnyder-NOAA
Copy link
Collaborator Author

EdwardSnyder-NOAA commented Mar 7, 2024

Gaea passed for me with the latest changes here: /gpfs/f5/epic/scratch/Edward.Snyder/pr_1048/ufs-srweather-app

+ [[ 0.98789 < 0.700 ]]
+ echo 'Congrats! You pass check!'

@RatkoVasic-NOAA
Copy link
Collaborator

Gaea worked for me as well:

+ [[ 0.98789 < 0.700 ]]
+ echo 'Congrats! You pass check!'

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA -

Thank you very much for working with @RatkoVasic-NOAA and me to address our concerns!

The Derecho test successfully ran:

Skill Score: 0.98937
+ [[ 0.98937 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Congrats! You pass check!

The Gaea test successfully ran:

Skill Score: 0.98789
+ [[ 0.98789 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Congrats! You pass check!

The Orion test successfully passed:

Skill Score: 0.99043
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Congrats! You pass check!

Approving this PR now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 8, 2024
@MichaelLueken
Copy link
Collaborator

MichaelLueken commented Mar 11, 2024

The Jet get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h WE2E test failed in make_ics and make_lbcs due to OOM issues. Using rocotorewind/rocotoboot, the test successfully passed.

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240308170044                                           COMPLETE              17.40
custom_ESGgrid_20240308170046                                      COMPLETE              17.69
custom_ESGgrid_Great_Lakes_snow_8km_20240308170047                 COMPLETE              12.49
custom_GFDLgrid_20240308170049                                     COMPLETE               8.85
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              10.45
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              50.99
get_from_HPSS_ics_RAP_lbcs_RAP_20240308170053                      COMPLETE              15.61
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240308170055  COMPLETE             215.15
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              41.45
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.24
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             494.60
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.73
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             903.65

The tests have successfully passed on Derecho, Gaea, and Hercules. The tests are still running on Hera.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA -

Given that Hera GNU tests are just sitting in queue for days and the inability to run Hera GNU on Rocky8, the successful run of the Hera Intel will be enough to get this work merged. Once HPSS has returned to service following maintenance, I will manually run the Jenkins coverage tests on Hera Intel and post the summary in this PR.

There was a failure in the get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2mems WE2E test on Orion that is currently being rerun via the pipeline. Once this test successfully completes and the Rocky8 Hera Intel test is complete, I will move forward with merging this PR.

@MichaelLueken
Copy link
Collaborator

The Hera Intel coverage WE2E tests were successfully run using Rocky8:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240312180224                            COMPLETE              18.40
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024031  COMPLETE               6.71
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             766.52
get_from_HPSS_ics_HRRR_lbcs_RAP_20240312180228                     COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               6.20
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              12.91
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240312180232  COMPLETE              10.50
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.02
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             233.03
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240312  COMPLETE             309.05
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403121  COMPLETE             330.33
pregen_grid_orog_sfc_climo_20240312180239                          COMPLETE               7.73
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1722.79

The Orion tests are still sitting in queue, so will continue to hold off until the tests on Orion are complete and final check with @EdwardSnyder-NOAA before merging this PR.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA -

While running the srw_metric_example.sh script, the WE2E test failed to run, leading to the test failing. It looks like RUN_WE2E_OPTi should be replaced with RUN_WE2E_OPT.

.cicd/scripts/srw_metric_example.sh Outdated Show resolved Hide resolved
@MichaelLueken
Copy link
Collaborator

The Jenkins tests failed to kick off the WE2E coverage tests on Orion, so I manually ran them and they all passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km_20240312102903                             COMPLETE             164.23
deactivate_tasks_20240312102905                                    COMPLETE               1.31
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             758.24
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             358.74
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             139.42
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202403121  COMPLETE              15.45
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240312102  COMPLETE             379.58
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              31.00
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             277.98
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202403  COMPLETE              27.61
nco_20240312102917                                                 COMPLETE               7.94
2020_CAD_20240312102919                                            COMPLETE              32.28
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2193.78

I'm running one last test on the latest update to the .cicd/scripts/srw_metric_example.sh script and once it passes, I will reapprove and merge this PR.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA -

Thanks for correcting the typos! My retest successfully passed:

Skill Score: 0.99043
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Congrats! You pass check!

So I will now reapprove this PR and get it merged.

@MichaelLueken MichaelLueken merged commit 5f461da into ufs-community:develop Mar 14, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants