Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] cicd functional testing script #707

Merged

Conversation

BruceKropp-Raytheon
Copy link
Collaborator

DESCRIPTION OF CHANGES:

This script is expected to run functional tests within the SRW application for any of the supported platforms.
The goal is to perform some basic setup and execution of the initial few workflow tasks,
as described in the user documentation section: "Run the Workflow Using Stand-Alone Scripts".
This would following a normal SRW build in an attempt to exercise environment setup, modules,
data sets, and workflow scripts, without using too much time nor account resources.
Hoping to catch any snags that might prevent follow-up WE2E fundamental testing.

Type of change

  • New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

Making this script available in .cicd/scripts/ for future use as an intermediary functional test gate after CI build stage.

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • enhancement
  • documentation
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test

CONTRIBUTORS (optional):

BruceKropp-Raytheon and others added 4 commits March 27, 2023 14:35
Signed-off-by: Bruce Kropp <Bruce.Kropp@Raytheon.com>
Signed-off-by: Bruce Kropp <Bruce.Kropp@Raytheon.com>
Some platforms are missing python3 or other support for modules and conda, but it does not necessarily harm workflow tests.
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceKropp-Raytheon I was able to make this script work within the current Jenkins framework on Orion with the changes that I noted in my review. Please note that the make_orog task is failing:
./run_make_orog.sh ... FAIL rc=1

The log file shows:
/work/noaa/epic-ps/mlueken/ufs-srweather-app/install_intel/exec/orog: error while loading shared libraries: libirng.so: cannot open shared object file: No such file or directory

.cicd/scripts/srw_ftest.sh Show resolved Hide resolved
.cicd/scripts/srw_ftest.sh Show resolved Hide resolved
.cicd/scripts/srw_ftest.sh Show resolved Hide resolved
Comment on lines 71 to 79
./config_utils.py -c $(pwd)/config.yaml -v $(pwd)/config_defaults.yaml
cd ${workspace}

# Activate the workflow environment ...
source etc/lmod-setup.sh ${platform,,}
module use modulefiles
module load build_${platform,,}_${SRW_COMPILER}
module load wflow_${platform,,}
conda activate regional_workflow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config_utils.py script requires python3 to be available before being called. I moved the activation of the workflow environment to before line 68 and all that you should need to load the environment is:
source ${workspace}/ush/load_modules_wflow.sh ${platform,,}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great hint. Probably the documentation needs to be updated to reflect this better approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the workflow environment with:
source ${workspace}/ush/load_modules_wflow.sh ${platform,,}
was necessary to keep the script from erroring out for me due to attempting to activate the regional_workflow conda environment.

Further testing showed that using this script is why the run_orog task was failing (the build modulefile is purged when calling the load_modules_wflow.sh script). I was able to create a new script, ush/load_modules_ftest.sh, that loads the build modulefile, the wflow modulefile, then activates regional_workflow. With this script, I was able to successfully run the run_orog wrapper script, passing the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are interested in seeing what I did with the new script, please see:
/work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/load_modules_ftest.sh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed that including module load build_${platform}_${compiler} resolves the issue with run_orog. A rework commit is on this PR.

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon Thanks for updating the PR! I'm now encountering a new error while attempting to manually run these tests:

+ ./config_utils.py -c /work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/config.yaml -v /work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/config_defaults.yaml
Traceback (most recent call last):
  File "/work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/./config_utils.py", line 14, in <module>
    cfg_main()
  File "/work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/python_utils/config_parser.py", line 591, in cfg_main
    r = check_structure_dict(cfg, cfg_t)
  File "/work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/python_utils/config_parser.py", line 488, in check_structure_dict
    for k, v in dict_o.items():
AttributeError: 'NoneType' object has no attribute 'items'

Have you seen this error during your testing?

@BruceKropp-Raytheon
Copy link
Collaborator Author

@MichaelLueken
I do not see the same error, but a different one:
INVALID ENTRY: metatask_run_ensemble={'task_run_fcst_mem#mem#': {'walltime': '02:00:00'}}

And I notice new change at the end of ush/config.community.yaml that might be related:

rocoto:
tasks:
metatask_run_ensemble:
task_run_fcst_mem#mem#:
walltime: 02:00:00

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon On Friday, PR #676 was merged, which overhauled the parm/FV3LAM_wflow.xml file. The issue you are encountering could be related to the overhaul work (the rocoto: tasks were part of the overhaul).

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon If you replace line 86:

./config_utils.py -c $(pwd)/config.yaml -v $(pwd)/config_defaults.yaml

with:

./config_utils.py -c .$(pwd)/config.community.yaml -v .$(pwd)/config_defaults.yaml -k "(\!rocoto\b)"

then the consistency check should successfully pass on Orion:

+ ./config_utils.py -c ./config.community.yaml -v ./config_defaults.yaml -k '(\!rocoto\b)'
SUCCESS

This has also successfully been tested on Gaea, Hera (Intel and GNU), Jet, and Cheyenne (Intel and GNU).

Issues encountered on Hera:

I'm seeing the following failure in the make_grid task:

/scratch1/NCEPDEV/stmp2/Michael.Lueken/ufs-srweather-app/scripts/exregional_make_grid.sh: line 63: ulimit: stack size: cannot modify limit: Operation not permitted

When setting FORGIVE_CONDA=false, conda activate regional_workflow fails with:

/scratch1/NCEPDEV/nems/role.epic/miniconda3/4.12.0/envs/regional_workflow/etc/conda/activate.d/proj4-activate.sh: line 6: PROJ_LIB: unbound variable

Issue encountered on Jet:

I'm seeing the following failure in the get_ics task:

ERROR: You requested the hpss data store, but the HPSS module isn't loaded. This data store is only available on NOAA compute platforms.

It looks like the script will need be updated to ensure that the following is being set in the config.yaml file:

USE_USER_STAGED_EXTRN_FILES: true

Further testing with this setting is still attempting to pull data from HPSS rather than use the staged data. I'm not entirely sure why this is the case.

Issue encountered on Cheyenne

When setting FORGIVE_CONDA=false, conda activate regional_workflow fails with:

/glade/work/epicufsrt/contrib/miniconda3/4.12.0/envs/regional_workflow/etc/conda/activate.d/proj4-activate.sh: line 6: PROJ_LIB: unbound variable

@BruceKropp-Raytheon
Copy link
Collaborator Author

@MichaelLueken
Hi Mike.
Using either of these on Orion:
./config_utils.py -c $(pwd)/config.community.yaml -v $(pwd)/config_defaults.yaml -k "(!rocoto\b)"
./config_utils.py -c ./config.community.yaml -v ./config_defaults.yaml -k "(!rocoto\b)"

The response is:
INVALID ENTRY: metatask_run_ensemble={'task_run_fcst_mem#mem#': {'walltime': '02:00:00'}}
FAILURE

The documentation suggest deriving ./config.yaml from a copy of ./config.community.yaml, then verify any changes there using this:
./config_utils.py -c $(pwd)/config.yaml -v $(pwd)/config_defaults.yaml -k "(!rocoto\b)"
or this:
./config_utils.py -c ./config.yaml -v ./config_defaults.yaml -k "(!rocoto\b)"

Either way gives the same error as above.

@BruceKropp-Raytheon
Copy link
Collaborator Author

@MichaelLueken Hi Mike. Using either of these on Orion:
./config_utils.py -c $(pwd)/config.community.yaml −v $(pwd)/config_defaults.yaml -k "(!rocoto\b)"
./config_utils.py -c ./config.community.yaml -v ./config_defaults.yaml -k "(!rocoto\b)"

The response is: INVALID ENTRY: metatask_run_ensemble={'task_run_fcst_mem#mem#': {'walltime': '02:00:00'}} FAILURE

The documentation suggest deriving ./config.yaml from a copy of ./config.community.yaml, then verify any changes there using this:
./config_utils.py -c $(pwd)/config.yaml −v $(pwd)/config_defaults.yaml -k "(!rocoto\b)"
or this:
./config_utils.py -c ./config.yaml -v ./config_defaults.yaml -k "(!rocoto\b)"

Either way gives the same error as above.

@MichaelLueken
Copy link
Collaborator

@BruceKropp-Raytheon

Hi Bruce,

Very interesting. When I use:

./config_utils.py -c $(pwd)/config.yaml -v $(pwd)/config_defaults.yaml -k "(!rocoto\b)"

I end up with the following:

+ ./config_utils.py -c /work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/config.yaml -v /work/noaa/epic-ps/mlueken/ufs-srweather-app/ush/config_defaults.yaml -k '(!rocoto\b)'
SUCCESS

In PR #701, @danielabdi-noaa has made changes that should allow the above notation to get past the rocoto: issue. We might need to make PR #701 a dependency on this work to ensure that the consistency check passes successfully on Orion.

@BruceKropp-Raytheon
Copy link
Collaborator Author

In addition, when following instructions to enable graphics plots during the workflow:
from https://ufs-srweather-app.readthedocs.io/en/develop/RunSRW.html#task-configuration

workflow_switches:
RUN_TASK_PLOT_ALLVARS: true

checking this using config_utils.py reports this error:

INVALID ENTRY: workflow_switches={'RUN_TASK_PLOT_ALLVARS': True}

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceKropp-Raytheon Thank you for working with me on this PR! Approving this work now.

# Set parameters that the task scripts require ...
export JOBSdir=${workspace}/jobs
export USHdir=${workspace}/ush
export PDY=20190615
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is PDY mean to stay fixed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDY here is what is used for the default sample. If we want to allow a different sample, then we need a few of these do be variables.
Note: some of these definitions, JOBSdir, USHdir, ... were missing from the docs, and were discovered by trial+error.

# SRW_COMPILER=<intel|gnu>
#
# Optional:
[[ -n ${ACCOUNT} ]] || ACCOUNT="no_account"
Copy link
Collaborator

@ulmononian ulmononian Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the srw as it stands functionally accommodate $ACCOUNT to be set as no_account?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. it allows the first 4 tasks to run without a valid ACCOUNT. The remaining require an ACCOUNT

Copy link
Collaborator

@ulmononian ulmononian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks solid and useful to me!

@MichaelLueken MichaelLueken merged commit e9503cd into ufs-community:develop Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants