-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Esmf/pio issue on Gaea, causing failure of cpld_control_ciceC_p8 & cpld_control_c192_p8 #1683
Comments
I ran these two tests using the current develop branch (with #1633 merged in) using updated pio and esmf (pio/2.5.10, esmf/8.4.1) and the tests passed. The hpc-stack install directory is here |
Awesome! @natalie-perlin can we follow up on this? Sounds like we need to re-install. |
@natalie-perlin I mean we can make sure this issue is reflected with next round of library updates. We clearly need new pio and esmf versions. Let's try to be on same page about this issue. |
@jkbk2004 , @DusanJovic-NOAA -
Is failing of the test in (1) caused by new code requirements that need higher versions, i.e., pio/2.5.10 and esmf/8.4.1? For (2) - there is a new installation on Gaea that use hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. Could you please run the test using this stack and let me know if this helps (note the new modules/versions to update in the modulefiles). |
@natalie-perlin I can test this out the new installation on Gaea and let you know if successful. |
@natalie-perlin I'm having some issues testing with the new installation. New installation: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49 The compile finishes successfully however it fails right before run_test, so there's no real specific error just the below. I updated ufs_gaea.intel.lua in ufs-weather-model/modulefiles to include the new modulepath and updated ufs_common.lua to include the new versions of hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. |
@zach1221 , the error is in rt.sh (line 801) it cannot find rt_*.log files. |
Are you using the standard hpc-stack installation location in the original issue? |
Issue has been determined and fixed. The default modulefile was pointing to a later installation |
@natalie-perlin still receiving "the error is in rt.sh (line 801) it cannot find rt_*.log files." with these two cases on Gaea. Trying to troubleshoot and dig up additional info. |
@zach1221 - if you are testing only the build ("compile") but not the run, you may not have any rt_*log files, which are created after the "run" phase. Maybe adding a conditional check if such files are present could help to avoid the error. Your log file /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel/compile_001.log reports test completed: |
@natalie-perlin I'm trying to run the case as well, after compiling, but it gets caught after the compile complete with the rt_*log error. I'll investigate further and update when I've found the cause. |
Apologies for the delay, @natalie-perlin . Tests cpld_control_ciceC_p8 & cpld_control_c192_p8 worked for me on Gaea using pio/2.5.10, and esmf/8.4.1 from @DusanJovic-NOAA's installation he mentioned above. Maybe this is the direction we should go in for updating esmf/pio on Gaea? I couldn't get the standard/current version of pio/2.5.7 & esmf/8.3.0b09 to work. |
@zach1221 @DusanJovic-NOAA @DusanJovic-NOAA - does your stack build use netcdf/4.9.1 or netcdf/4.9.2?.. |
netcdf/4.7.4 See the install directory is here |
Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly? |
I did test again with the combination of hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0, and the cases cpld_control_ciceC_p8 and cpld_control_c192_p8 pass fine. Could we use this configuration on Gaea currently or does hdf5, netcdf and mapl also need to be updated with esmf/pio? |
The gocart failure looks similar to #1629 |
Thanks, @DusanJovic-NOAA . It does look similar, and based on what I'm reading from issue 1621, there may be outstanding problem. @natalie-perlin Could be GOCART related issue currently with running some tests using library based on (hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2). This may also give you cause to go with alternative (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0) |
@zach1221 @DusanJovic-NOAA @jkbk2004 - |
@zach1221 - Ready for Gaea in the stack: Verifying loading the modules:
|
Thanks @natalie-perlin I will test this today! |
@natalie-perlin I've added modules/3.2.11.4 to the modulefile for Gaea, but it seems Lmod is unable to locate it. Here's my modulefile setup. |
@zach1221 - |
Hi, @natalie-perlin I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1. Steps to reproduce.
|
Please verify that the Lmod initialization is run in Gaea before the modules are loaded. You could test that the modules are loaded properly, after your steps 1-3, as following:
You could also build a code needed for cpld_control_ciceC_p8 test, which builds with no issues
It gives some kind of warnings in the end of the build, but it builds the executable:
|
@zach1221 for testing purposes on Gaea, instead of sourcing module-setup.sh, you could just source the Lmod initialize script, and then load the ufs_gaea.intel:
|
Started rt.sh job as well, the compilation is finished successfully, see |
@natalie-perlin I've got it working now. I'll update you here as soon as the tests pass. |
@natalie-perlin latest stack installation at /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/ seems to have resolved the issue regarding cpld_control_ciceC_p8 & cpld_control_c192_p8 on Gaea. fyi @jkbk2004 |
@natalie-perlin thanks for the information, let me try again when GAEA is back |
@natalie-perlin is intel/2022.2.1 available on both of C4 and C5? |
@jkbk2004 @natalie-perlin cpld_control_ciceC_p8 & cpld_control_c192_p8 fail on c4 due to the inability to load intel/2021.3.0. I'm testing on c3 now. |
It sounds like intel-2022.0.2/classic/oneapi will be most practical option. |
C4 (gaea13 check): intel-classic/2022.2.1 Yes, same name on both C4 and C5 partitions |
Both cpld_control_ciceC_p8 and cpld_control_c192_p8 run successfully now on Gaea. Logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/logs/RegressionTests_gaea.log Gaea has been re-enabled for these two tests in ufs-wm #1912 . |
Description
When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.
These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.
To Reproduce:
What compilers/machines are you seeing this with?
Intel
Give explicit steps to reproduce the behavior.
Additional context
Output
Screenshots
![Gaea_err](https://user-images.githubusercontent.com/99902696/228050310-18d6776a-56e0-49aa-9809-7a4c7b0f8209.PNG)
complains about esmf/pio stack libraries.
output logs
If applicable, include relevant output logs.
Either drag and drop the entire log file here (if a long log) or
-->
The text was updated successfully, but these errors were encountered: