Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Integrate 'online-cmaq' into 'develop' branch: Part 3, workflow set-up and run scripts #613

Merged
merged 140 commits into from
Feb 15, 2023

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Feb 11, 2023

DESCRIPTION OF CHANGES:

Currently, the Online-CMAQ for air quality modeling is separately managed in the 'online-cmaq' branch of the authoritative UFS SRW App repository. The main target of the current 'online-cmaq' branch is the implementation of AQM ver.7 and delivery to NCO in April. Therefore, it has been developed and tested on WCOSS2 and Hera. The Online-CMAQ checks out the same ufs weather model as the 'develop' branch. However, since it couples another component 'AQM', the compile process is different from that of the 'develop' branch.

This 'online-cmaq' branch has beeen integrating into the 'develop' branch with PR #536, PR #549, and this PR. For the convenience of the reviewers, we split the 'online-cmaq' branch into three parts as follows:

Part 1: j-job and ex- scripts (PR #536)
Part 2: build scripts and module files (PR #549)
Part 3: workflow set-up, generation, run scripts (this PR)

git clone -b online-cmaq https://github.com/ufs-community/ufs-srweather-app
cd ufs-srweather-app
./manage_externals/checkout_externals
./devbuild.sh -p=hera (or wcoss2) -a=ATMAQ
cd ush
cp config.aqm.community.yaml config.yaml
vim config.yaml (update MACHINE and ACCOUNT)
(load the python environment for the workflow)
python3 generate_FV3LAM_wflow.py
  • Note that the last four tasks (post_stat_o3, post_stat_pm25, bias_correction_o3, and bias_correction_pm25) only work on WCOSS2.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • WE2E tests:
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
    MET_verification
    community_ensemble_2mems_stoch

  • hera.intel

  • orion.intel

  • cheyenne.intel

  • cheyenne.gnu

  • gaea.intel

  • jet.intel

  • wcoss2.intel

  • NOAA Cloud (indicate which platform)

  • Jenkins

  • fundamental test suite

  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #534.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

chan-hoo and others added 30 commits October 26, 2022 12:20
@danielabdi-noaa
Copy link
Collaborator

@chan-hoo Thank you very much for addressing my suggestions and in such a short time! I would like to go over the rest of the code today and re-run the test case one more, and will give my approval then.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo I re-run the test case after renaming HOMEdir, SCRIPTSdir etc using rename_model.sh and it run the test case successfully!. There is a minor glitch in the rename script regarding "test" directory that is no longer there. Other than that I have left a couple of minor comments but this looks good to me so I am approving! Thanks again.

ush/load_modules_run_task.sh Outdated Show resolved Hide resolved
ush/machine/wcoss2.yaml Show resolved Hide resolved
ush/job_preamble.sh Show resolved Hide resolved
@danielabdi-noaa
Copy link
Collaborator

@chan-hoo Last task pre_post_stat for the aqm_v16_test failed with the following error on Hera

 ncks: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

That was not the case for the aqm_community_aqmna13 I did earlier. The pre_post_stat.local modulefile is loaded in both cases.

@chan-hoo
Copy link
Collaborator Author

@chan-hoo Last task pre_post_stat for the aqm_v16_test failed with the following error on Hera

 ncks: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

That was not the case for the aqm_community_aqmna13 I did earlier. The pre_post_stat.local modulefile is loaded in both cases.

@danielabdi-noaa, can you try it again? I didn't get this error on my end. The error was caused by that the 'nco' module was not loaded properly. I guess it was a temporary system issue (hopefully :) )

@danielabdi-noaa
Copy link
Collaborator

@chan-hoo Do I need to re-build the app with your new commits? I have tried a module purge and reloading the modules but didn't make a difference. My experiment directory is here: /scratch2/BMC/gsd-hpcs/Daniel.Abdi/expt_dirs/aqm_v16_test

@chan-hoo
Copy link
Collaborator Author

@danielabdi-noaa, I don't think so. I compared my run to yours (/scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/online-cmaq_test/expt_dirs/aqm_v16_test/log/pre_post_stat_2023011700.log):

Currently Loaded Modules:
  1) nco/4.9.3   2) pre_post_stat.local

In your log file:

Currently Loaded Modules:
  1) hdf5/1.10.6   2) netcdf/4.7.4   3) nco/4.9.3   4) pre_post_stat.local

I can see hdf5 and netcdf in your log file. I think they caused this issue. I am not sure why they were loaded. I'll run it again.

@danielabdi-noaa
Copy link
Collaborator

@chan-hoo Yes, I noticed that too. Compared to the previous run I made that finished successfully,

/scratch2/BMC/gsd-hpcs/Daniel.Abdi/expt_dirs/aqm_community_aqmna13/log/pre_post_stat_2023011700.log

the new one had the extra libs loaded.

@chan-hoo
Copy link
Collaborator Author

@danielabdi-noaa, I don't know what happened. I cloned the latest one and ran the sample script again. All the tasks were completed successfully. Can you try with the latest one again?

@danielabdi-noaa
Copy link
Collaborator

@chan-hoo I re-run the test case and it succeeded this time!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo I have two minor requests associated with commented out lines in the new ush/config.aqm.*.yaml files. If you would like to keep these commented lines, that is fine, but if they aren't going to be used in the future, it would probably be best to remove these lines.

ush/config.aqm.community.yaml Outdated Show resolved Hide resolved
ush/config.aqm.community.yaml Outdated Show resolved Hide resolved
ush/config.aqm.nco.realtime.yaml Outdated Show resolved Hide resolved
ush/config.aqm.nco.realtime.yaml Outdated Show resolved Hide resolved
@MichaelLueken
Copy link
Collaborator

@chan-hoo Thank you very much for addressing my concerns! In the bug fixes merged this morning, there is an additional update to the .cicd/Jenkinsfile that will need to be merged into online-cmaq before I can submit the Jenkins tests. At your earliest convenience, please merge these two updates into online-cmaq. Thanks!

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, merged.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Thanks for addressing my concerns and merging in the latest updates! Approving now.

@MichaelLueken MichaelLueken added ci-hera-intel-WE Kicks off automated workflow test on hera with intel run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Feb 14, 2023
@venitahagerty venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 14, 2023
@venitahagerty
Copy link
Collaborator

venitahagerty commented Feb 14, 2023

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1237737580/20230214155015/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: pregen_grid_orog_sfc_climo
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: MET_ensemble_verification
All experiments completed

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, Jenkins test failed. Can you take a look at the error messages?

@MichaelLueken
Copy link
Collaborator

@chan-hoo The test that failed was Gaea because the Jenkins runner was initialized before the system was ready to run jobs, resulting in the tests failing. I'm working to relaunch the Gaea tests now.

@MichaelLueken
Copy link
Collaborator

@chan-hoo The test on Gaea successfully passed. Moving forward with merging this work now.

@MichaelLueken MichaelLueken merged commit 2427e93 into develop Feb 15, 2023
@chan-hoo chan-hoo deleted the online-cmaq branch February 16, 2023 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants