Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix time offset issue on ICS with GFS nemsio and netcdf files and add new archive file name on HPSS #457

Merged
merged 12 commits into from
Nov 9, 2022

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Nov 5, 2022

DESCRIPTION OF CHANGES:

  1. The surface climatology file names (gfs.t{hh}z.sfcf{fcst_hr:03d}.[nemsio/netcdf]) are added to file_names of nemsio/fcst and netcdf/fcst in parm/data_locations.yml.
  • When TIME_OFFSET is off, the make_ics task reads gfs.t12z.atmanl.nemsio(netcdf) and gfs.t12z.sfcanl.nemsio(netcdf)
  • When TIME_OFFSET is on (for example: 6), the make_ics task reads gfs.t06z.atmf006.nemsio(netcdf) and gfs.t06z.sfcf006.nemsio(netcdf). The flag anl is switched to fcst.
  • In the current list of 'file_names', gfs.t{hh}z.sfcf{fcst_hr:03d} is missing in fcst of FV3GFS.
  • However, this gfs.t{hh}z.sfcf{fcst_hr:03d} should be removed for make_lbcs to avoid any unnecessary work with the files. Therefore, the flag ics_or_lbcs is added to ush/retrieve_data.py.
  1. New archive file names of GFS data are added to parm/data_locations.yml.
  • Old file name (with _prod_): com_gfs_prod_gfs.{yyyymmdd}_{hh}.gfs_pgrb2.tar
  • New file anme (with _v16.2_): com_gfs_v16.2_gfs.{yyyymmdd}_{hh}.gfs_pgrb2.tar

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • New WE2E test:
    nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16

  • WE2E tests:
    community_ensemble_2mems_stoch
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
    grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
    nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km

  • hera.intel

  • orion.intel

  • cheyenne.intel

  • cheyenne.gnu

  • gaea.intel

  • jet.intel

  • wcoss2.intel

  • NOAA Cloud (indicate which platform)

  • Jenkins

  • fundamental test suite

  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #456

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Nov 5, 2022

@christinaholtNOAA @danielabdi-noaa, if you have any better idea, please let me know.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

if cla.external_model == "FV3GFS" and cla.ics_or_lbcs == "LBCS":
del file_templates['nemsio']['fcst'][1]
del file_templates['netcdf']['fcst'][1]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to add a similar entry to the "nemsio" format for this to work. Changing the format to nemsio for the new test case you added I get this error:

Traceback (most recent call last):
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/retrieve_data.py", line 1012, in <module>
    main(sys.argv[1:])
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/retrieve_data.py", line 814, in main 
    file_templates = get_file_templates(
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/retrieve_data.py", line 296, in get_file_templates
    del file_templates['nemsio']['fcst'][1]
IndexError: list assignment index out of range

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is happening because the gfs_file_names anchor is shared for all sources, so when it tries hpss and doesn't succeed it deletes the sfc entry. Then when it tries aws, the entry is already gone. I think you can solve this by not using anchors in the yaml file (bad) or by making a deep copy of the dictionary before deleting the entries.

Copy link
Collaborator Author

@chan-hoo chan-hoo Nov 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa, nemsio is not available for the date (08/2022). netcdf and grib2 are only available. Do we need to add any other conditions for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa, I've added if-statements to check the availability of netcdf and nemsio for the cycle date in jobs/JREGIONAL_GET_EXTRN_MDL_FILES. If nemsio (or netcdf) is not available, the get_extrn_ics/lbcs will fail with an error message.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo I still think we need a safeguard against multiple deletions of the sfc file -- best would be to find another way that does not delete entries in the data_locations dictionary. If hpss does not have the file for some reason (maybe it is down), and we want to try aws next, it will fail because the sfc entry has been deleted by HPSS. The invalid nemsio date run i did (although wrong) shows what could happen if we can't find the file on hpss but could be available on aws. I added this before the del if statement to make it try aws successfully when it could not find it in hpss

file_templates = deepcopy(file_templates)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa, yes, I agree with you. Can you help me? I've sent an invitation to you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo No big deal, i've pushed the change i think will avoid multiple deletions of the sfc file. It is not an ideal solution so lets wait for @christinaholtNOAA .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo These changes look good to me. I have built your branch and submitted the new nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 WE2E test on Hera. The new error message when setting a file format that has past the maximum CDATE is a great addition! I will go ahead and launch the Jenkins tests for this work now. Once @christinaholtNOAA has had the opportunity to review this work, I will approve.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Nov 7, 2022
@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Nov 7, 2022

@MichaelLueken , Thank you!

@panll
Copy link
Collaborator

panll commented Nov 7, 2022

Tests passed on Here. It looks good to me!

@danielabdi-noaa
Copy link
Collaborator

@chan-hoo The MET_verification test is going to fail on cheyenne, but is not your fault. I'm going to have to look at the log files when it finishes to figure out why, but would you be willing to remove that test case from fundamental.cheyenne.gnu with your PR?

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Nov 7, 2022

@danielabdi-noaa, updated.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo With the exception of the MET_verification test on Cheyenne (which your changes wouldn't affect), all of the Jenkins tests have successfully passed.

I would like to note that the new nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 test fails on Cheyenne. It appears as though the test data needs to be staged on that machine. Since the data isn't available, the test will fail. Before this test can be moved to either the fundamental or comprehensive test sets, the new ICs and LBCs will need to be brought over to Cheyenne.

@MichaelLueken
Copy link
Collaborator

Pending no additional feedback from @christinaholtNOAA, I will merge this PR before 5 pm EST.

@christinaholtNOAA
Copy link
Collaborator

Actually, I just now saw this. Give me just a little while to take a look?

@MichaelLueken
Copy link
Collaborator

Actually, I just now saw this. Give me just a little while to take a look?

@christinaholtNOAA No rush. I'll wait until you have approved the PR before merging.

@MichaelLueken
Copy link
Collaborator

@chan-hoo The MET_verification test is going to fail on cheyenne, but is not your fault. I'm going to have to look at the log files when it finishes to figure out why, but would you be willing to remove that test case from fundamental.cheyenne.gnu with your PR?

@danielabdi-noaa It looks like there is an issue with the metplus installation on Cheyenne:

/glade/p/ral/jntp/MET/MET_releases/10.1.1/bin/grid_stat: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

I'll run a test using @natalie-perlin's Cheyenne HPC-stack location and see if this error persists.

@danielabdi-noaa
Copy link
Collaborator

@MichaelLueken Thanks for looking into the issue. There are some more issues to address with EPIC modulefiles which I've summarized in this issue #458

Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo I'm sorry for the late response. These emails didn't immediately grab my attention.

I left a couple of comments below with some concerns and suggestions.


# Remove sfc files from fcst in file_names of FV3GFS for LBCs
# sfc files needed in fcst when time_offset is not zero.
if cla.external_model == "FV3GFS" and cla.ics_or_lbcs == "LBCS":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be if "lbcs" and "fcst"? That way we don't tightly couple the tool to specifics of the models we use as inputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA, I am afraid that it will cause another unexpected errors with other models. For example, 'GDAS', 'GSMGFS', 'RAP', 'HRRR', and 'NAM' have only one file name for 'fcst'. In these cases, we will have the same index error. For 'GEFS', we should not remove the second array from the file names. We just want to remove the 'sfc' file from the list. 'FV3GFS' only has this issue. What do you think of this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is still pretty brittle. It assumes that there must be at least 2 and that the 2nd one is the surface file. I think that if we want to protect that assumption, we need a functional test.

I also think it's probably safer to do something like delete the entry if "sfc" is in the name of the file. That type of logic may get us the additional functionality we'd need to use offset GDAS netcdf/nemsio as ICS and expand the logic to "lbcs" and "fcst" instead of FV3GFS-specific logic. In general that says "if we're looking for lbcs from forecasts and a surface file is listed, don't try to get it".

As best I can tell, sfc files only show up for GDAS, FV3GFS, and GSMGFS.

I'd prefer to see something like this:

if cla.ics_or_lbcs == "LBCS":
    for format in ['netcdf', 'nemsio']:
        for i, tmpl in enumerate(file_templates.get('format', {}).get('fcst', [])):
            if "sfc" in tmpl:
                 del file_templates[format]['fcst'][i]

It reduces the assumptions about order and the length of the list provided. It assures that we're only removing surface files, and not atm files accidentally. It also helps us if we want to expand our experiments run with offset GDAS files as ICs -- we can just add the sfc file to the GDAS fcst entry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA, I tested your script, but sfc file was not removed from the list for LBCs: DEBUG: Looking for files like ['gfs.t{hh}z.atmf{fcst_hr:03d}.nemsio', 'gfs.t{hh}z.sfcf{fcst_hr:03d}.nemsio']. Any suggestion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a typo there: 'format' => format. I am testing it again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies...I was just writing directly in the text box here and didn't do a full test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA , the tests were completed successfully!! The part has been replaced with your script. Thank you!

@@ -34,8 +34,8 @@ task_run_fcst:
FIXlut: /lfs/h2/emc/lam/noscrub/UFS_SRW_App/develop/fix/fix_lut
data:
GSMGFS: compath.py ${envir}/gsmgfs/${gsmgfs_ver}/gsmgfs.${PDY}
FV3GFS: compath.py ${envir}/gfs/${gfs_ver}/gfs.${PDY}
FV3GFS: compath.py ${envir}/gfs/${gfs_ver}/gfs.${PDY}/${cyc}/atmos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ${hh} here instead of ${cyc}? This is a bit misleading because this ${cyc} is not necessarily aligned with the cyc set for the RRFS config.

Then, there's no need to rename cyc in the ex-script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree. it has been replaced with 'hh'.

@@ -89,7 +89,7 @@ yyyymmdd=${yyyymmddhh:0:8}
mm=${yyyymmddhh:4:2}
dd=${yyyymmddhh:6:2}
hh=${yyyymmddhh:8:2}

cyc=${hh}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially a problem. It resets an NCO required variable. This is the GFS cyc that we're looking for, which is a namespace collision with the existing cyc for RRFS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Nov 8, 2022

@christinaholtNOAA, no problem at all. I understand. We are receiving so many emails from SRW :) I've replaced 'cyc' with 'hh'. However, I'd like to hear from you about the if-statement condition as I mentioned above.

@MichaelLueken
Copy link
Collaborator

@chan-hoo I have gone ahead and relaunched the Jenkins test for Cheyenne with the GNU compiler. All tests successfully passed. Once @christinaholtNOAA has approved, I will go ahead and merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants