[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

mkavulich · 2023-04-17T06:27:51Z

Note to code manager: The label run_we2e_fundamental_tests should be renamed to run_we2e_coverage_tests after this PR is merged.

DESCRIPTION OF CHANGES:

This test continues the overhaul of WE2E test suites as described in Issue #587 (specifically stage 3 and parts of stage 4 in this comment). The changes are summarized below, roughly in order of importance.

"fundamental" tests are replaced by "coverage" test suites. "fundamental" tests are returned to their original purpose: a lightweight set of tests to be run the same on all platforms. "coverage" tests now evenly distribute all comprehensive tests across all platform/compiler combinations for use in Jenkins testing.
"comprehensive" test list is updated to include all tests (except current known failures). For platforms that have known failures (for example, HPSS tests on on platforms without HPSS access), comprehensive.<platform>[.<compiler>] files are included to automatically run only the tests expected to succeed
Fix several existing failures
- Use correct date format in grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
- In config_parser.py, when populating a jinja template, keep dates in string format rather than converting to a datetime object (this fixes problem with get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS)
- Fix unit tests for retrieve_data.py, there was a bug causing all tests to be run in nested subdirectories that eventually leads to failure when running all tests including HPSS retrieval
Remove several "get_from_HPSS" tests in favor of new unit tests for HPSS data in test_retrieve_data.yaml
Add several more dates and data sources to unit tests in test_retrieve_data.yaml
The example config files in the ush/ directory (config.community.yaml and config.nco.yaml) are now included as WE2E tests (symbolically linked in the tests/WE2E/test_configs/default_configs/ directory)
Remove long-known failing test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 (WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" fails with segmentation fault at run_fcst step #359). This is now an old capability with only legacy support (global spectral model was retired in 2019) and there are no immediate plans to fix the bug.
WE2E_summary*.txt files are now written to the experiment directory rather than tests/WE2E
Updated data_locations.yaml for latest RAP files on HPSS
Reduce timeouts and delays between calls to wget to speed up remote data retrieval
run_MET_GridStat_vx_APCP tasks fail randomly on occasion; increasing maxtries to 2 mitigates this problem
Swap test of restart capabilty from grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8 for coverage reasons
Convert some print_info_msg messages to logging.debug calls to allow suppression of superfluous output if desired
A few miscellaneous minor fixes to log messages
Made some docstrings more consistent format
Removed some outdated documentation on validating config.yaml

Notes on current test limitations

Coverage tests will result in failures for Hera with intel due to outstanding known failures in NCO mode as described in issues PCP Combine tasks fail in NCO mode #688 and NCO mode (run_envir="nco") results in random failures for WE2E tests #652 (as inherited from previous "fundamental" testing, the coverage tests for Hera with intel are all run in NCO mode). These should be run with caution until these issues are resolved.
The test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta is a known failure currently, and so has been removed from the pool of tests for now. This problem is described in issue WE2E test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta fails at the run_fcst step #731
Currently some tests are failing on Orion due to a problem with the staged test data. I am working with @natalie-perlin to get this resolved.
Several tests have not yet completed as of the opening of this PR; will update the test section below as needed.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

TESTS CONDUCTED:

DEPENDENCIES:

None.

DOCUMENTATION:

Documentation for WE2E tests, including the table of test descriptions, has been updated. You can review the built documentation here: https://ufs-srweather-app-mkavulich.readthedocs.io/en/latest/WE2Etests.html

ISSUE:

Resolves Unit tests for retrieve_data.py will not run successfully on Jet, Hera #727
Resolves run_post fails for NA_3km domain #705
Resolves Latest HPSS RAP files are not supported #521
Resolves WE2E test "GST_release_public_v1" fails on most platforms due to running out of wallclock time #347
Resolves Plotting tasks do not run the correct script; plots are not created. #742
Part of Overhaul and consolidate WE2E tests, identify needed additional tests #587 though I will keep that issue open as there are still some things that need to be implemented (though I may consolidate the remaining work into a new issue for clarity).

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published

…ush/ directory to list of valid tests via symlinks in new directory test_configs/default_configs/. Also include these new tests in the comprehensive suite.

…rectory, standardize function docstrings

…enerate_FV3LAM_wflow.py should always output descriptive error messages if invalid config.yaml is provided

- Fix chdir bug in test_retrieve_data.py - Relax timeout and delay times for wget commands in retrieve_data.py - Various minor code fixes

- Add test date for early RAP data with ICS - Retrieve RAP 09z out to 45 hours

…V3GFS_suite_WoFS_v0 test

…like object, keep it as a string. This fixes NOMADS test error (and any other test using the "days_ago" template)

…"wontfix" test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16

…uite to its original purpose (small set of cheap tests to run on any machine)

…ite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8

…under 100 core hours

… for now, it is not working; see issue 731

… tests

…n_post tasks to use 12 nodes to resolve ufs-community#705

…occasional walltime-related failures

…ents

mkavulich · 2023-04-19T20:16:33Z

@MichaelLueken thanks for keeping me updated. I have pushed a fix to the NOMADS test (not sure why it was including checks for HPSS and AWS as well).

Note that I am not sure if the NOMADS test should work on Gaea (since I can not test it myself), so if it fails again I can move that test to another machine in the coverage set.

…ts 10/10)

MichaelLueken · 2023-04-21T15:18:56Z

@mkavulich The majority of the Jenkins issues have been addressed. The EPIC role account has also been granted access to HPSS, but we don't currently have rstprod access. This is causing the get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test to fail in the get_extrn_ics task. I've reached out to the RDHPCS help desk to see if it is possible to add rstprod access to a role account via AIM. If it is, then the EPIC program office will put in the request. My latest run of your PR is using a sandbox pipeline so that your modified Jenkinsfile will be used. If you would like to check on the progress on Hera, the path is /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/line_-_Remote_Jenkinsfile_PR-732.

On other machines, replacing fs-srweather-app_pipeline_PR-732 with line_-_Remote_Jenkinsfile_PR-732 will allow you to check the experiments.

… fix errors in wget due to the built-in wget on Orion being quite old

mkavulich · 2023-04-21T18:09:12Z

@MichaelLueken Thanks for kicking off the tests again. Note that there may be some failures on Orion still: I discovered some more static data directories that do not have read permissions. Once that is fixed the tests should succeed.

I also added "wget" to the list of modules to load on Orion; this was a suggestion from the Orion helpdesk to solve some wget errors for the NOMADs test. This shouldn't affect the results since the NOMADS test is not run on Orion for the coverage tests, but I thought I would mention it in case other failures pop up.

MichaelLueken · 2023-04-21T18:13:45Z

@mkavulich The GSMGFS test that is failing due to attempting to pull restricted data is get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS in the Hera GNU coverage suite.

On Gaea, get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS is still failing. The reason for the failure is:

/bin/sh: wget: command not found

It might be better to replace the get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test with get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS

…capability on Gaea; remove get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test due to data restrictions...developers have decided that this legacy capability is not worth fretting over

mkavulich · 2023-04-24T19:23:11Z

@MichaelLueken The changes are now in for running different tests (I also included a fix for a bug in the plotting script in #742); unless there is a failure I missed the only coverage tests that should need re-running prior to merge are Gaea and Hera/GNU.

MichaelLueken · 2023-04-25T13:58:21Z

@mkavulich PR #736 was merged earlier this morning. This PR updated the test_retrieve_data.py unittest script. Please merge the latest develop into your branch to correct the conflict. Thanks!

MichaelLueken · 2023-04-25T18:38:28Z

@mkavulich The latest testing has completed. All tests pass on both Gaea and Hera GNU. A rerun of Cheyenne Intel, however, is still showing two persistent failures (these failures were previously noted, but the cause of the error was blamed on directory naming).

The grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test is failing in the run_post_mem000_f00* tasks with the following message:

This post cannot produce IFI icing products because it was not compiled with libIFI.

The Jenkins directory for this experiment is - /glade/scratch/epicufsrt/jenkins/workspace/line_-_Remote_Jenkinsfile_PR-732__2/expt_dirs/grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2.

The nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 test is failing in get_extrn_ics with the following message:

INFO: Getting file: /glade/p/ral/jntp/UFS_CAM/COMGFS/gfs.20220810/06/gfs.t06z.sfcf006.nc

INFO: File does not exist on disk
 /glade/p/ral/jntp/UFS_CAM/COMGFS/gfs.20220810/06/gfs.t06z.sfcf006.nc
 try using: --input_file_path <your_path>

The Jenkins directory for this experiment is - /glade/scratch/epicufsrt/jenkins/workspace/line_-_Remote_Jenkinsfile_PR-732__2/expt_dirs/nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16.

I can see that there is no gfs.t06z.sfcf006.nc file in the expected directory. Do we need to reach out to the AUS team to see why this file isn't present on Cheyenne?

MichaelLueken · 2023-04-25T18:50:19Z

@mkavulich It looks like moving the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test to Hera should work (Hera includes the new IFI library). I'll test it there and see how it works.

…improvement_round_2

…meoffset_suite_GFS_v16 test, data needs to be staged

…to Hera, intel due to bad libraries problem on Cheyenne

mkavulich · 2023-04-26T06:30:33Z

@MichaelLueken I have merged in the latest changes and made changes to the test files so the coverage tests should now succeed.

I can not figure out why the test_retrieve_data unit tests are failing. They are running successfully (albeit very slowly due to the large file size) on Hera. Do you have any insight on this? Without being able to replicate it I don't know how to solve this problem.

MichaelLueken · 2023-04-26T14:46:59Z

@mkavulich Interestingly, while attempting to run the failing unit test on Hera using your branch, I'm seeing the same failures that I have noted below regarding the UFS-CASE-STUDY ICs and LBCs from AWS.

Looking at the details for the failed Python functional tests, I'm seeing the following:

ERROR: test_ufs_ics_from_aws (test_retrieve_data.FunctionalTesting)
Get UFS-CASE-STUDY ICS from aws
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/test_retrieve_data.py", line 459, in test_ufs_ics_from_aws
    retrieve_data.main(args)
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line 833, in main
    unavailable = get_requested_files(
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line [36](https://github.com/ufs-community/ufs-srweather-app/actions/runs/4805465573/jobs/8552121179?pr=732#step:4:37)1, in get_requested_files
    orig_path = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

and

ERROR: test_ufs_lbcs_from_aws (test_retrieve_data.FunctionalTesting)
Get UFS-CASE-STUDY LBCS from aws for 3 hour boundary conditions
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/test_retrieve_data.py", line 488, in test_ufs_lbcs_from_aws
    retrieve_data.main(args)
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line [83](https://github.com/ufs-community/ufs-srweather-app/actions/runs/4805465573/jobs/8552121179?pr=732#step:4:84)3, in main
    unavailable = get_requested_files(
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line 361, in get_requested_files
    orig_path = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

It appears as though the unit test is failing due to the inability to find the the files that are supposed to be pulled from AWS. Looking at the file naming convention for the CAPE 2020 case study, the data being pulled should be correct:

INFO: Getting files named like ['{yyyymmddhh}.gfs.nemsio.tar.gz'] for ICs and INFO: Getting files named like ['{yyyymmddhh}_bc.atmf{fcst_hr:03d}.nemsio.tar.gz'] for LBCs.

Unfortunately, the DEBUG level prints aren't available in the log for the failed unit test, so there are no messages like:

DEBUG: Looking for fhr = 6 
 
DEBUG: Looking for files like ['gep{mem:02d}.t{hh}z.pgrb2a.0p50.f{fcst_hr:03d}', 'gep{mem:02d}.t{hh}z.pgrb2b.0p50.f{fcst_hr:03d}'] 
 
DEBUG: They should be here: ['https://noaa-gefs-pds.s3.amazonaws.com/gefs.{yyyymmdd}/{hh}/atmos/pgrb2ap5', 'https://noaa-gefs-pds.s3.amazonaws.com/gefs.{yyyymmdd}/{hh}/atmos/pgrb2bp5']

At this point, I can only think that the files that the unit test is attempting to pull doesn't include https://ufs-case-studies.s3.amazonaws.com/ in the URL, causing the test to fail to find the data.

MichaelLueken · 2023-04-26T17:02:25Z

@mkavulich The unit tests are now passing. I will retest Cheyenne Intel and Hera Intel, then merge this work. Thank you very much!

mkavulich · 2023-04-26T18:13:08Z

Okay, I think I figured out what was happening (thought I still don't understand it). The new tests didn't have the "chdir" commands requested by @danielabdi-noaa, and for some reason those tests specifically were failing without the chdir commands (despite others working). The problem didn't appear for me because I was running those tests individually rather than as part of the full set of tests. I pushed a change to include those lines and it seems to be working now.

Regarding the older failures, the nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 is a weird one that I had not noticed before: it is meant to be using staged netCDF data from FV3GFS, but that data isn't staged in the Epic locations. Jet and Hera tests are only succeeding due to the fallback to HPSS retrieval, and on Cheyenne it was still pointing to a legacy DTC space staged location (which was still missing files). I have added those files to the DTC space so it should pass now, but I've removed it from the list so that it can be re-added later with properly staged data.

I have merged in the latest changes and swapped the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test to Hera, so hopefully everything should be passing now.

MichaelLueken · 2023-04-26T19:34:33Z

The unit tests have successfully passed and reruns of Cheyenne Intel and Hera Intel show that the tests pass (with the exceptions of verification tests, outlined in issue #688). Merging this PR to develop now.

MichaelLueken · 2023-04-26T19:43:42Z

Renamed the run_we2e_fundamental_tests label to run_we2e_coverage_tests. Updated the Jenkins automated testing pipeline to pick up PRs with the new run_we2e_coverage_tests label rather than run_we2e_fundamental_tests.

mkavulich added 21 commits April 16, 2023 23:11

Add example files 'config.community.yaml' and 'config.nco.yaml' from …

a4f3b1f

…ush/ directory to list of valid tests via symlinks in new directory test_configs/default_configs/. Also include these new tests in the comprehensive suite.

Write WE2E summary file to experiment directory instead of current di…

374ee1f

…rectory, standardize function docstrings

Remove references to validating config.yaml with ./config_utils.py; g…

896765a

…enerate_FV3LAM_wflow.py should always output descriptive error messages if invalid config.yaml is provided

- Convert several WE2E tests (retrieving data from HPSS) to unit tests

afe19ad

- Fix chdir bug in test_retrieve_data.py - Relax timeout and delay times for wget commands in retrieve_data.py - Various minor code fixes

Update latest RAP hpss filenames, add unit tests for RAP hpss dates

efa4a79

- Cut down on intermediate test dates for retrieving data

fedc647

- Add test date for early RAP data with ICS - Retrieve RAP 09z out to 45 hours

Still dealing with random failures for GridStat, bump up "maxtries" to 2

e15b8ae

Fix incorrect date format for grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_F…

5d4db18

…V3GFS_suite_WoFS_v0 test

Fix a few unit test errors for retrieve_data.py, all now pass

ce53e21

For extend_yaml, when rendering a jinja template that returns a date-…

d1b497c

…like object, keep it as a string. This fixes NOMADS test error (and any other test using the "days_ago" template)

Don't overwrite existing walltime for re-run test

4681715

Update comprehensive test suite to include all working tests, remove …

96beedb

…"wontfix" test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16

Rename "fundamental" tests to "coverage", revert "fundamental" test s…

49df27c

…uite to its original purpose (small set of cheap tests to run on any machine)

Skip any lines in read file that have comment character

bcbdf99

Swap restart test from grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_su…

b1edc67

…ite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8

Implement new, wider-coverage fundamental test suite. Still comes in …

8d30eac

…under 100 core hours

Add back NCO tests that are now working

3b3c9c8

Remove grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test…

85ce577

… for now, it is not working; see issue 731

Distribute all comprehensive tests among all platforms via "coverage"…

f87eb1a

… tests

Increase grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta ru…

2f32728

…n_post tasks to use 12 nodes to resolve ufs-community#705

Up run_fcst walltime to 2 hours for GST_release_public_v1 to prevent …

8c96f6c

…occasional walltime-related failures

mkavulich self-assigned this Apr 17, 2023

mkavulich requested review from gsketefian, JeffBeck-NOAA, RatkoVasic-NOAA, BenjaminBlake-NOAA, ywangwof, chan-hoo, panll and christinaholtNOAA as code owners April 17, 2023 06:27

Restore "chdir" to tmpdir in test_retrieve_data.py per PR review comm…

02a2b8f

…ents

Fix date error in FV3GFS netcdf test, add pylint suggestions (now lin…

28ae19d

…ts 10/10)

MichaelLueken added the jenkins_test New label used to test Jenkins sandbox pipeline label Apr 21, 2023

Load wget module on Orion; this is a suggestion from Orion support to…

f12d769

… fix errors in wget due to the built-in wget on Orion being quite old

mkavulich added 2 commits April 24, 2023 18:06

Move NOMADS test from Gaea to Hera GNU due to unavailability of that …

431c510

…capability on Gaea; remove get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test due to data restrictions...developers have decided that this legacy capability is not worth fretting over

Fix plotting task workflow yaml to run correct script

b5abac6

danielabdi-noaa approved these changes Apr 24, 2023

View reviewed changes

Merge remote-tracking branch 'origin/develop' into feature/WE2E_test_…

cbf5a03

…improvement_round_2

mkavulich mentioned this pull request Apr 26, 2023

Overhaul and consolidate WE2E tests, identify needed additional tests #587

Closed

mkavulich added 2 commits April 26, 2023 06:08

Temporarily remove nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_ti…

d70a47a

…meoffset_suite_GFS_v16 test, data needs to be staged

Move grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test …

c9799f1

…to Hera, intel due to bad libraries problem on Cheyenne

Trying to solve server-side CI problems errors (local tests are passing)

46c75ef

MichaelLueken merged commit 2035c46 into ufs-community:develop Apr 26, 2023
2 checks passed

This was referenced May 2, 2023

[develop] Add precipitation-type verification for MET #757

Merged

[develop] Remove redundancies in loading the run-time python environment. #761

Merged

MichaelLueken mentioned this pull request May 16, 2023

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

mkavulich commented Apr 17, 2023 •

edited

mkavulich commented Apr 19, 2023 •

edited

MichaelLueken commented Apr 21, 2023

mkavulich commented Apr 21, 2023

MichaelLueken commented Apr 21, 2023

mkavulich commented Apr 24, 2023

MichaelLueken commented Apr 25, 2023

MichaelLueken commented Apr 25, 2023

MichaelLueken commented Apr 25, 2023

mkavulich commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

mkavulich commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

Conversation

mkavulich commented Apr 17, 2023 • edited

DESCRIPTION OF CHANGES:

Notes on current test limitations

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

mkavulich commented Apr 19, 2023 • edited

MichaelLueken commented Apr 21, 2023

mkavulich commented Apr 21, 2023

MichaelLueken commented Apr 21, 2023

mkavulich commented Apr 24, 2023

MichaelLueken commented Apr 25, 2023

MichaelLueken commented Apr 25, 2023

MichaelLueken commented Apr 25, 2023

mkavulich commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

mkavulich commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

MichaelLueken commented Apr 26, 2023

mkavulich commented Apr 17, 2023 •

edited

mkavulich commented Apr 19, 2023 •

edited