Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Integrate Unified Workflow templater tool #793

Merged

Conversation

christinaholtNOAA
Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA commented May 11, 2023

DESCRIPTION OF CHANGES:

This is the first time a Unified Workflow tool is being integrated into a UFS App. We are starting with the templater tool. It's a direct drop-in replacement for fill_jinja_template.py.

This contribution includes the following additional changes:

  • Updates Externals.cfg to include workflow-tools as an external dependency repo (will be updated to point to a hash once the corresponding branch is merged)
  • Updates module files for the workflow and the tasks to set PYTHONPATH to the necessary locations for uwtools.
  • Allows the templater tool to do its own logging and error handling for clarity, instead of SRW callers.
  • Lints the Python code perturbed by this integration in the spirit of stewardship.
  • Adds a .pylintrc to help manage some of the project-wide settings that will be needed by SRW to lint appropriately.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

PR 237 in workflow-tools: ufs-community/uwtools#237 (This has been merged and the hash updated in Externals.cfg)
For the linter to pass, we'll also need PR #788 to be merged in.

DOCUMENTATION:

A small update was made.

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@fgabelmannjr @venitahagerty @WeirAE @elcarpenter @j-derrico

@christinaholtNOAA
Copy link
Collaborator Author

It appears that the functional test is failing because data was removed from AWS. This date only goes out to 17 z files now. Thoughts on how to "fix" this failure?

Switch to a different date that does have data?

@christinaholtNOAA
Copy link
Collaborator Author

@chan-hoo @EdwardSnyder-NOAA Have I addressed your comments sufficiently, or are there still outstanding issues I can work to resolve?

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA Similar to what was done for the task-based python_srw.lua modulefiles, should the wflow_*.lua modulefiles also be updated to replace regional_workflow with workflow_tools?

@chan-hoo
Copy link
Collaborator

@christinaholtNOAA, I think you missed my comment above:

Traceback (most recent call last):
File "/lfs/h2/emc/lam/noscrub/chan-hoo.jeon/srw_pr_test/ufs-srweather-app/ush/create_aqm_rc_file.py", line 179
, in
create_aqm_rc_file(
File "/lfs/h2/emc/lam/noscrub/chan-hoo.jeon/srw_pr_test/ufs-srweather-app/ush/create_aqm_rc_file.py", line 135
, in create_aqm_rc_file
set_template(
File "/lfs/h2/emc/lam/noscrub/chan-hoo.jeon/srw_pr_test/ufs-srweather-app/ush/python_utils/workflow-tools/scripts/templater.py", line 119, in set_template
cfg = setup_config_obj(user_args, log_name=log.name)
File "/lfs/h2/emc/lam/noscrub/chan-hoo.jeon/srw_pr_test/ufs-srweather-app/ush/python_utils/workflow-tools/scri
pts/templater.py", line 88, in setup_config_obj config_type = cli_helpers.get_file_type(user_args.config_file)
File "/lfs/h2/emc/lam/noscrub/chan-hoo.jeon/srw_pr_test/ufs-srweather-app/ush/python_utils/workflow-tools/src/uwtools/utils/cli_helpers.py", line 33, in get_file_type
raise ValueError(msg)
ValueError: Bad file suffix -- . Cannot determine file type!

@EdwardSnyder-NOAA
Copy link
Collaborator

@chan-hoo @EdwardSnyder-NOAA Have I addressed your comments sufficiently, or are there still outstanding issues I can work to resolve?

Yes, you have. Approving on my end.

@christinaholtNOAA
Copy link
Collaborator Author

@chan-hoo I pushed a change this morning to hopefully address the failure you were seeing.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA These changes look good to me! I was able to clone your branch on Tier-1 platforms (except for Gaea, as the SRW App won't compile or run following the system upgrade) and run the WE2E fundamental tests.

I did note that loading the regional_workflow conda environment and attempting to run the fundamental tests would cause some of the tests to fail. Loading the workflow_tools conda environment, all tests seem to pass successfully. I would recommend that the modulefiles/wflow_*.lua files be updated to replace regional_workflow with workflow_tools (similar to the change made in ush/load_modules_wflow.sh).

Please note that the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 fundamental test is failing on Cheyenne GNU. Running the test on the current HEAD of develop shows that the test is failing there as well, so I won't hold this PR up for this reason (issue #817 has been opened to document this failure).

@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken I will try to get to that change shortly. Thanks for the suggestion.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA In addition to changing the regional_workflow to workflow_tools in the wflow_*.lua modulefiles, please change modulefiles/tasks/jet/run_vx.local.lua to use:

load("python_srw")

The grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 fundamental test is failing on Jet because the necessary conda environment isn't available (the test is failing because it is unable to find f90nml).

Copy link
Collaborator

@chan-hoo chan-hoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA, a test run for AQM was completed successfully on wcoss2.

@chan-hoo chan-hoo added Tested on WCOSS2 Successfully tested on WCOSS2 machine and removed Needs WCOSS2 test Testing needs to be run on WCOSS2 machine labels Jun 2, 2023
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA Given @chan-hoo's approval, I will move forward running the Jenkins tests on this work. The modifications to the wflow_*.lua modulefiles and modulefiles/tasks/jet/run_vx.local.lua file shouldn't affect the automated tests. This will also make sure that the automated tests are properly set up to run with the UW templater too.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jun 2, 2023
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA The automated Jenkins tests encountered failures:

Cheyenne GNU - the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test failed (detailed in issue #806).

Gaea - SRW won't build following the latest maintenance on the machine. Will be turning off Gaea testing in #799 until the UFS-WM has merged the new Gaea modulefile changes.

Hera Intel - the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test failed (detailed in issue #805).

I saw that the there were two failures in the Hera GNU test - get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200 and get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS. These tests shouldn't be failing, so I have resubmitted them. I will let you know if they continue to fail.

On Jet, the grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test failed in run_post_mem000_f003 with the message:

forrtl: severe (41): insufficient virtual memory

I will resubmit the Jet tests and hopefully this test will pass.

I will also run the Jenkins tests manually on Orion, since the machine requires git/2.28.0 to be loaded, but since git/2.28.0 isn't available on Hercules and Hercules and Orion are the same machine, the git/2.28.0 module load is constantly removed from the role-epic-ps account's .bashrc file. I will let you know if there are any failures.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA The manual testing of the Jenkins coverage tests on Orion successfully passed. However, the same test that failed on Jet, grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta, has failed again. This time, it has failed in run_post_mem000_f001 with the following message:

srun: error: k4: task 13: Killed
srun: launch/slurm: _step_signal: Terminating StepId=28933725.0
slurmstepd: error: *** STEP 28933725.0 ON k4 CANCELLED AT 2023-06-05T15:36:57 ***

The test can be found - /mnt/lfs4/HFIP/hfv3gfs/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-793/expt_dirs/grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta.

@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken I saw that the test that was failing on Jet yesterday now seems to be passing. I also saw that running that manual test on Jet passed for me yesterday.

Is there any update on the status of the other failures? It seems we may want to look more into random failures for the post if we keep seeing those creep in.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA The failures are on Gaea (the SRW App won't compile following the maintenance on the machine that changed the available libraries on the machine), Orion (the necessary git/2.28.0 module isn't being loaded in the role-epic-ps account, causing the Jenkins pipeline to fail to clone the CCPP), and an issue on Hera GNU (the get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200 test is failing). I have manually run the Orion tests and all tests passed. I will rerun the Hera GNU test that failed, then move forward with merging once it has passed.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA The Hera GNU tests successfully completed during the rerun, so I will now move forward with merging this PR.

Additionally, it is important to note that the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test successfully passed with this PR. I won't close issue #805 right away, but will perform additional testing before closing the issue as resolved. The nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km is still failing in this PR. So, that issue is still open. I haven't been looking into these failures either, I've been focusing on fixing the issues with the updated WM hashes.

@MichaelLueken MichaelLueken merged commit 8b382cc into ufs-community:develop Jun 9, 2023
3 of 5 checks passed
@MichaelLueken MichaelLueken mentioned this pull request Jun 15, 2023
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests Tested on WCOSS2 Successfully tested on WCOSS2 machine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants