Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Apr 17, 2023

Note to code manager: The label run_we2e_fundamental_tests should be renamed to run_we2e_coverage_tests after this PR is merged.

DESCRIPTION OF CHANGES:

This test continues the overhaul of WE2E test suites as described in Issue #587 (specifically stage 3 and parts of stage 4 in this comment). The changes are summarized below, roughly in order of importance.

  • "fundamental" tests are replaced by "coverage" test suites. "fundamental" tests are returned to their original purpose: a lightweight set of tests to be run the same on all platforms. "coverage" tests now evenly distribute all comprehensive tests across all platform/compiler combinations for use in Jenkins testing.
  • "comprehensive" test list is updated to include all tests (except current known failures). For platforms that have known failures (for example, HPSS tests on on platforms without HPSS access), comprehensive.<platform>[.<compiler>] files are included to automatically run only the tests expected to succeed
  • Fix several existing failures
    • Use correct date format in grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
    • In config_parser.py, when populating a jinja template, keep dates in string format rather than converting to a datetime object (this fixes problem with get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS)
    • Fix unit tests for retrieve_data.py, there was a bug causing all tests to be run in nested subdirectories that eventually leads to failure when running all tests including HPSS retrieval
  • Remove several "get_from_HPSS" tests in favor of new unit tests for HPSS data in test_retrieve_data.yaml
  • Add several more dates and data sources to unit tests in test_retrieve_data.yaml
  • The example config files in the ush/ directory (config.community.yaml and config.nco.yaml) are now included as WE2E tests (symbolically linked in the tests/WE2E/test_configs/default_configs/ directory)
  • Remove long-known failing test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 (WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" fails with segmentation fault at run_fcst step #359). This is now an old capability with only legacy support (global spectral model was retired in 2019) and there are no immediate plans to fix the bug.
  • WE2E_summary*.txt files are now written to the experiment directory rather than tests/WE2E
  • Updated data_locations.yaml for latest RAP files on HPSS
  • Reduce timeouts and delays between calls to wget to speed up remote data retrieval
  • run_MET_GridStat_vx_APCP tasks fail randomly on occasion; increasing maxtries to 2 mitigates this problem
  • Swap test of restart capabilty from grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8 for coverage reasons
  • Convert some print_info_msg messages to logging.debug calls to allow suppression of superfluous output if desired
  • A few miscellaneous minor fixes to log messages
  • Made some docstrings more consistent format
  • Removed some outdated documentation on validating config.yaml

Notes on current test limitations

  1. Coverage tests will result in failures for Hera with intel due to outstanding known failures in NCO mode as described in issues PCP Combine tasks fail in NCO mode #688 and NCO mode (run_envir="nco") results in random failures for WE2E tests #652 (as inherited from previous "fundamental" testing, the coverage tests for Hera with intel are all run in NCO mode). These should be run with caution until these issues are resolved.
  2. The test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta is a known failure currently, and so has been removed from the pool of tests for now. This problem is described in issue WE2E test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta fails at the run_fcst step #731
  3. Currently some tests are failing on Orion due to a problem with the staged test data. I am working with @natalie-perlin to get this resolved.
  4. Several tests have not yet completed as of the opening of this PR; will update the test section below as needed.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
    • Had failures that were resolved on re-run
    • Ran test_retrieve_data.py successfully
  • hera.gnu
    • Coverage tests successful
  • orion.intel
    • Coverage tests successful
  • cheyenne.intel
    • Tests in progress
  • cheyenne.gnu
    • Coverage tests successful
  • gaea.intel
  • jet.intel
    • Coverage tests successful
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
    • The "new" (restored old functionality) fundamental tests were run on Hera (intel) successfully
  • comprehensive tests
    • The new suite was run on Hera successfully

DEPENDENCIES:

None.

DOCUMENTATION:

Documentation for WE2E tests, including the table of test descriptions, has been updated. You can review the built documentation here: https://ufs-srweather-app-mkavulich.readthedocs.io/en/latest/WE2Etests.html

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

…ush/ directory to list of valid tests via symlinks in new directory test_configs/default_configs/. Also include these new tests in the comprehensive suite.
…enerate_FV3LAM_wflow.py should always output descriptive error messages if invalid config.yaml is provided
 - Fix chdir bug in test_retrieve_data.py
 - Relax timeout and delay times for wget commands in retrieve_data.py
 - Various minor code fixes
 - Add test date for early RAP data with ICS
 - Retrieve RAP 09z out to 45 hours
…like object, keep it as a string. This fixes NOMADS test error (and any other test using the "days_ago" template)
…"wontfix" test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
…uite to its original purpose (small set of cheap tests to run on any machine)
…ite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8
@mkavulich
Copy link
Collaborator Author

mkavulich commented Apr 19, 2023

@MichaelLueken thanks for keeping me updated. I have pushed a fix to the NOMADS test (not sure why it was including checks for HPSS and AWS as well).

Note that I am not sure if the NOMADS test should work on Gaea (since I can not test it myself), so if it fails again I can move that test to another machine in the coverage set.

@MichaelLueken MichaelLueken added the jenkins_test New label used to test Jenkins sandbox pipeline label Apr 21, 2023
@MichaelLueken
Copy link
Collaborator

@mkavulich The majority of the Jenkins issues have been addressed. The EPIC role account has also been granted access to HPSS, but we don't currently have rstprod access. This is causing the get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test to fail in the get_extrn_ics task. I've reached out to the RDHPCS help desk to see if it is possible to add rstprod access to a role account via AIM. If it is, then the EPIC program office will put in the request. My latest run of your PR is using a sandbox pipeline so that your modified Jenkinsfile will be used. If you would like to check on the progress on Hera, the path is /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/line_-_Remote_Jenkinsfile_PR-732.

On other machines, replacing fs-srweather-app_pipeline_PR-732 with line_-_Remote_Jenkinsfile_PR-732 will allow you to check the experiments.

… fix errors in wget due to the built-in wget on Orion being quite old
@mkavulich
Copy link
Collaborator Author

@MichaelLueken Thanks for kicking off the tests again. Note that there may be some failures on Orion still: I discovered some more static data directories that do not have read permissions. Once that is fixed the tests should succeed.

I also added "wget" to the list of modules to load on Orion; this was a suggestion from the Orion helpdesk to solve some wget errors for the NOMADs test. This shouldn't affect the results since the NOMADS test is not run on Orion for the coverage tests, but I thought I would mention it in case other failures pop up.

@MichaelLueken
Copy link
Collaborator

@mkavulich The GSMGFS test that is failing due to attempting to pull restricted data is get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS in the Hera GNU coverage suite.

On Gaea, get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS is still failing. The reason for the failure is:

/bin/sh: wget: command not found

It might be better to replace the get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test with get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS

…capability on Gaea; remove get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test due to data restrictions...developers have decided that this legacy capability is not worth fretting over
@mkavulich
Copy link
Collaborator Author

@MichaelLueken The changes are now in for running different tests (I also included a fix for a bug in the plotting script in #742); unless there is a failure I missed the only coverage tests that should need re-running prior to merge are Gaea and Hera/GNU.

@MichaelLueken
Copy link
Collaborator

@mkavulich PR #736 was merged earlier this morning. This PR updated the test_retrieve_data.py unittest script. Please merge the latest develop into your branch to correct the conflict. Thanks!

@MichaelLueken
Copy link
Collaborator

@mkavulich The latest testing has completed. All tests pass on both Gaea and Hera GNU. A rerun of Cheyenne Intel, however, is still showing two persistent failures (these failures were previously noted, but the cause of the error was blamed on directory naming).

The grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test is failing in the run_post_mem000_f00* tasks with the following message:

This post cannot produce IFI icing products because it was not compiled with libIFI.

The Jenkins directory for this experiment is - /glade/scratch/epicufsrt/jenkins/workspace/line_-_Remote_Jenkinsfile_PR-732__2/expt_dirs/grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2.

The nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 test is failing in get_extrn_ics with the following message:

INFO: Getting file: /glade/p/ral/jntp/UFS_CAM/COMGFS/gfs.20220810/06/gfs.t06z.sfcf006.nc

INFO: File does not exist on disk
 /glade/p/ral/jntp/UFS_CAM/COMGFS/gfs.20220810/06/gfs.t06z.sfcf006.nc
 try using: --input_file_path <your_path>

The Jenkins directory for this experiment is - /glade/scratch/epicufsrt/jenkins/workspace/line_-_Remote_Jenkinsfile_PR-732__2/expt_dirs/nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16.

I can see that there is no gfs.t06z.sfcf006.nc file in the expected directory. Do we need to reach out to the AUS team to see why this file isn't present on Cheyenne?

@MichaelLueken
Copy link
Collaborator

@mkavulich It looks like moving the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test to Hera should work (Hera includes the new IFI library). I'll test it there and see how it works.

…meoffset_suite_GFS_v16 test, data needs to be staged
…to Hera, intel due to bad libraries problem on Cheyenne
@mkavulich
Copy link
Collaborator Author

@MichaelLueken I have merged in the latest changes and made changes to the test files so the coverage tests should now succeed.

I can not figure out why the test_retrieve_data unit tests are failing. They are running successfully (albeit very slowly due to the large file size) on Hera. Do you have any insight on this? Without being able to replicate it I don't know how to solve this problem.

@MichaelLueken
Copy link
Collaborator

@mkavulich Interestingly, while attempting to run the failing unit test on Hera using your branch, I'm seeing the same failures that I have noted below regarding the UFS-CASE-STUDY ICs and LBCs from AWS.

Looking at the details for the failed Python functional tests, I'm seeing the following:

ERROR: test_ufs_ics_from_aws (test_retrieve_data.FunctionalTesting)
Get UFS-CASE-STUDY ICS from aws
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/test_retrieve_data.py", line 459, in test_ufs_ics_from_aws
    retrieve_data.main(args)
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line 833, in main
    unavailable = get_requested_files(
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line [36](https://github.com/ufs-community/ufs-srweather-app/actions/runs/4805465573/jobs/8552121179?pr=732#step:4:37)1, in get_requested_files
    orig_path = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

and

ERROR: test_ufs_lbcs_from_aws (test_retrieve_data.FunctionalTesting)
Get UFS-CASE-STUDY LBCS from aws for 3 hour boundary conditions
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/test_retrieve_data.py", line 488, in test_ufs_lbcs_from_aws
    retrieve_data.main(args)
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line [83](https://github.com/ufs-community/ufs-srweather-app/actions/runs/4805465573/jobs/8552121179?pr=732#step:4:84)3, in main
    unavailable = get_requested_files(
  File "/home/runner/work/ufs-srweather-app/ufs-srweather-app/ush/retrieve_data.py", line 361, in get_requested_files
    orig_path = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

It appears as though the unit test is failing due to the inability to find the the files that are supposed to be pulled from AWS. Looking at the file naming convention for the CAPE 2020 case study, the data being pulled should be correct:

INFO: Getting files named like ['{yyyymmddhh}.gfs.nemsio.tar.gz'] for ICs and INFO: Getting files named like ['{yyyymmddhh}_bc.atmf{fcst_hr:03d}.nemsio.tar.gz'] for LBCs.

Unfortunately, the DEBUG level prints aren't available in the log for the failed unit test, so there are no messages like:

DEBUG: Looking for fhr = 6 
 
DEBUG: Looking for files like ['gep{mem:02d}.t{hh}z.pgrb2a.0p50.f{fcst_hr:03d}', 'gep{mem:02d}.t{hh}z.pgrb2b.0p50.f{fcst_hr:03d}'] 
 
DEBUG: They should be here: ['https://noaa-gefs-pds.s3.amazonaws.com/gefs.{yyyymmdd}/{hh}/atmos/pgrb2ap5', 'https://noaa-gefs-pds.s3.amazonaws.com/gefs.{yyyymmdd}/{hh}/atmos/pgrb2bp5']

At this point, I can only think that the files that the unit test is attempting to pull doesn't include https://ufs-case-studies.s3.amazonaws.com/ in the URL, causing the test to fail to find the data.

@MichaelLueken
Copy link
Collaborator

@mkavulich The unit tests are now passing. I will retest Cheyenne Intel and Hera Intel, then merge this work. Thank you very much!

@mkavulich
Copy link
Collaborator Author

Okay, I think I figured out what was happening (thought I still don't understand it). The new tests didn't have the "chdir" commands requested by @danielabdi-noaa, and for some reason those tests specifically were failing without the chdir commands (despite others working). The problem didn't appear for me because I was running those tests individually rather than as part of the full set of tests. I pushed a change to include those lines and it seems to be working now.

Regarding the older failures, the nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 is a weird one that I had not noticed before: it is meant to be using staged netCDF data from FV3GFS, but that data isn't staged in the Epic locations. Jet and Hera tests are only succeeding due to the fallback to HPSS retrieval, and on Cheyenne it was still pointing to a legacy DTC space staged location (which was still missing files). I have added those files to the DTC space so it should pass now, but I've removed it from the list so that it can be re-added later with properly staged data.

I have merged in the latest changes and swapped the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test to Hera, so hopefully everything should be passing now.

@MichaelLueken
Copy link
Collaborator

The unit tests have successfully passed and reruns of Cheyenne Intel and Hera Intel show that the tests pass (with the exceptions of verification tests, outlined in issue #688). Merging this PR to develop now.

@MichaelLueken MichaelLueken merged commit 2035c46 into ufs-community:develop Apr 26, 2023
2 checks passed
@MichaelLueken
Copy link
Collaborator

Renamed the run_we2e_fundamental_tests label to run_we2e_coverage_tests. Updated the Jenkins automated testing pipeline to pick up PRs with the new run_we2e_coverage_tests label rather than run_we2e_fundamental_tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jenkins_test New label used to test Jenkins sandbox pipeline run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
No open projects
Status: Done
3 participants