Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esmf/pio issue on Gaea, causing failure of cpld_control_ciceC_p8 & cpld_control_c192_p8 #1683

Closed
zach1221 opened this issue Mar 27, 2023 · 40 comments
Assignees
Labels
bug Something isn't working

Comments

@zach1221
Copy link
Collaborator

Description

When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.

These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.

To Reproduce:

What compilers/machines are you seeing this with?
Intel
Give explicit steps to reproduce the behavior.

  1. Log into Gaea
  2. clone ufs-weather-model repo
  3. cd into ufs-weather-model/tests
  4. ./rt.sh -n cpld_control_ciceC_p8 and cpld_control_c192_p8

Additional context

Output

Screenshots
Gaea_err

complains about esmf/pio stack libraries.

output logs
If applicable, include relevant output logs.
Either drag and drop the entire log file here (if a long log) or

paste the code in this type of section (if a short section of log)

-->

@zach1221 zach1221 added the bug Something isn't working label Mar 27, 2023
@zach1221 zach1221 self-assigned this Mar 27, 2023
@DusanJovic-NOAA
Copy link
Collaborator

I ran these two tests using the current develop branch (with #1633 merged in) using updated pio and esmf (pio/2.5.10, esmf/8.4.1) and the tests passed. The hpc-stack install directory is here /lustre/f2/dev/Dusan.Jovic/hpc-stack/opt_intel_esmf_841/modulefiles

@jkbk2004
Copy link
Collaborator

Awesome! @natalie-perlin can we follow up on this? Sounds like we need to re-install.

@jkbk2004
Copy link
Collaborator

@natalie-perlin I mean we can make sure this issue is reflected with next round of library updates. We clearly need new pio and esmf versions. Let's try to be on same page about this issue.

@natalie-perlin
Copy link
Collaborator

@jkbk2004 , @DusanJovic-NOAA -
Just to be clear on the issues, there are two things:

  1. Failing of the test using the current hpc-stack, based on pio/2.5.7 and esmf/8.3.0b09 (?)
  2. Need to update these libraries to pio/2.5.10 and esmf/8.4.1

Is failing of the test in (1) caused by new code requirements that need higher versions, i.e., pio/2.5.10 and esmf/8.4.1?

For (2) - there is a new installation on Gaea that use hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2.
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49

Could you please run the test using this stack and let me know if this helps (note the new modules/versions to update in the modulefiles).
The S2SWA code compiles successfully with the new stack.

@zach1221
Copy link
Collaborator Author

@natalie-perlin I can test this out the new installation on Gaea and let you know if successful.

@zach1221
Copy link
Collaborator Author

@natalie-perlin I'm having some issues testing with the new installation.

New installation: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49
Note I'm testing with ufs-wm RT cases cpld_control_ciceC_p8 & cpld_control_c192_p8
My working directory logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel
Experiment directory: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_12942

The compile finishes successfully however it fails right before run_test, so there's no real specific error just the below.
err3

I updated ufs_gaea.intel.lua in ufs-weather-model/modulefiles to include the new modulepath and updated ufs_common.lua to include the new versions of hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2.

@natalie-perlin
Copy link
Collaborator

@zach1221 , the error is in rt.sh (line 801) it cannot find rt_*.log files.

@natalie-perlin
Copy link
Collaborator

Description

When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.

These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.

To Reproduce:

What compilers/machines are you seeing this with? Intel Give explicit steps to reproduce the behavior.

  1. Log into Gaea
  2. clone ufs-weather-model repo
  3. cd into ufs-weather-model/tests
  4. ./rt.sh -n cpld_control_ciceC_p8 and cpld_control_c192_p8

Additional context

Output

Screenshots Gaea_err

complains about esmf/pio stack libraries.

output logs If applicable, include relevant output logs. Either drag and drop the entire log file here (if a long log) or

paste the code in this type of section (if a short section of log)

-->

Are you using the standard hpc-stack installation location in the original issue?
It looks like there is indeed issue with the esmf, which points at a later esmf/8.5.0b17 installation instead of a standards esmf/8.3.0b09.
Which esmf vesion are you loading in your test?

@natalie-perlin
Copy link
Collaborator

Issue has been determined and fixed. The default modulefile was pointing to a later installation default -> 8.5.0b17.lua . It is pointing to the standard version now,
default -> 8.3.0b09.lua .
This hopefully resolves the original issue with the stack in
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch

@zach1221
Copy link
Collaborator Author

@natalie-perlin still receiving "the error is in rt.sh (line 801) it cannot find rt_*.log files." with these two cases on Gaea. Trying to troubleshoot and dig up additional info.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Mar 29, 2023

@zach1221 - if you are testing only the build ("compile") but not the run, you may not have any rt_*log files, which are created after the "run" phase. Maybe adding a conditional check if such files are present could help to avoid the error.

Your log file /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel/compile_001.log reports test completed:
16 min. TEST 001 compile is COMPLETED, status: - jobid 75128296

@zach1221
Copy link
Collaborator Author

@natalie-perlin I'm trying to run the case as well, after compiling, but it gets caught after the compile complete with the rt_*log error. I'll investigate further and update when I've found the cause.

@zach1221
Copy link
Collaborator Author

zach1221 commented Apr 3, 2023

Apologies for the delay, @natalie-perlin . Tests cpld_control_ciceC_p8 & cpld_control_c192_p8 worked for me on Gaea using pio/2.5.10, and esmf/8.4.1 from @DusanJovic-NOAA's installation he mentioned above. Maybe this is the direction we should go in for updating esmf/pio on Gaea? I couldn't get the standard/current version of pio/2.5.7 & esmf/8.3.0b09 to work.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Apr 4, 2023

@zach1221 @DusanJovic-NOAA
A new build of the hpc-stack on Gaea in EPIC location is available:
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf492/
It includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2

@DusanJovic-NOAA - does your stack build use netcdf/4.9.1 or netcdf/4.9.2?..

@DusanJovic-NOAA
Copy link
Collaborator

netcdf/4.7.4

See the install directory is here /lustre/f2/dev/Dusan.Jovic/hpc-stack/opt_intel_esmf_841/modulefiles

@zach1221
Copy link
Collaborator Author

zach1221 commented Apr 6, 2023

Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly?
image

@zach1221
Copy link
Collaborator Author

zach1221 commented Apr 6, 2023

I did test again with the combination of hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0, and the cases cpld_control_ciceC_p8 and cpld_control_c192_p8 pass fine. Could we use this configuration on Gaea currently or does hdf5, netcdf and mapl also need to be updated with esmf/pio?

@DusanJovic-NOAA
Copy link
Collaborator

Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly? image

The gocart failure looks similar to #1629

@zach1221
Copy link
Collaborator Author

zach1221 commented Apr 7, 2023

Thanks, @DusanJovic-NOAA . It does look similar, and based on what I'm reading from issue 1621, there may be outstanding problem.

@natalie-perlin Could be GOCART related issue currently with running some tests using library based on (hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2). This may also give you cause to go with alternative (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0)

@natalie-perlin
Copy link
Collaborator

@zach1221 @DusanJovic-NOAA @jkbk2004 -
Preparing an additional configuration on Gaea as suggested by Zach (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0-esmf-8.4.1)...

@natalie-perlin
Copy link
Collaborator

@zach1221 - Ready for Gaea in the stack:
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/

Verifying loading the modules:

source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack/
module load hpc/1.2.0
module load hpc-intel
module load hpc-cray-mpich/7.7.11
module load hdf5/1.10.6
module load netcdf/4.7.4
module load pio/2.5.10
module load esmf/8.4.1
module load mapl/2.22.0-esmf-8.4.1
module list 

Currently Loaded Modules:
  1) modules/3.2.11.4
  2) CmrsEnv
  3) TimeZoneEDT
...
 25) hpc/1.2.0
 26) intel/2021.3.0
 27) hpc-intel/2021.3.0
 28) cray-mpich/7.7.11
 29) hpc-cray-mpich/7.7.11
 30) hdf5/1.10.6
 31) netcdf/4.7.4
 32) pio/2.5.10
 33) esmf/8.4.1
 34) mapl/2.22.0-esmf-8.4.1


@zach1221
Copy link
Collaborator Author

Thanks @natalie-perlin I will test this today!

@zach1221
Copy link
Collaborator Author

@natalie-perlin I've added modules/3.2.11.4 to the modulefile for Gaea, but it seems Lmod is unable to locate it.
image

Here's my modulefile setup.
ufs_gaea.intel.txt

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Apr 12, 2023

@zach1221 -
There is no need to add this explicitly, as it is one of the system modules. All the system modules are loaded during Lmod initialization when the command source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh in the first line of my code snippet is executed.

@zach1221
Copy link
Collaborator Author

Hi, @natalie-perlin

I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1.
image

Steps to reproduce.

  1. clone ufs-wm community:dev repo
  2. cd ufs-weather-model/modulefiles
  3. edit ufs_common.lua change version numbers of mapl to "mapl/2.22.0-esmf-8.4.1", pio to "pio/2.5.10", and esmf to "esmf/8.4.1".
  4. edit ufs_gaea.intel.lua to add module path "/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack"
  5. cd ufs-weather-model/tests
  6. enable tests cpld_control_ciceC_p8 & cpld_control_c192_p8 to run on Gaea by editing rt.conf

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Apr 14, 2023

@zach1221

Please verify that the Lmod initialization is run in Gaea before the modules are loaded. You could test that the modules are loaded properly, after your steps 1-3, as following:

export MACHINE_ID=gaea
source tests/module-setup.sh
module use modulefiles
module load ufs_gaea.intel

You could also build a code needed for cpld_control_ciceC_p8 test, which builds with no issues

export CMAKE_FLAGS="-DAPP=S2SWA 
 -DCCPP_SUITES=FV3_GFS_v17_coupled_p8,FV3_GFS_cpld_rasmgshocnsstnoahmp_ugwp"
 ./build.sh

It gives some kind of warnings in the end of the build, but it builds the executable:

ld: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/intel-2021.3.0/cray-mpich-7.7.11/esmf/8.4.1/lib/libesmf.a(ESMCI_MethodTable.o): in function `ESMCI::MethodElement::resolve()':
/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-2021.3.0_noarch/pkg/v8.4.1/src/Superstructure/Component/src/ESMCI_MethodTable.C:400: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
ld: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/intel-2021.3.0/cray-mpich-7.7.11/esmf/8.4.1/lib/libesmf.a(ESMCI_VMKernel.o): in function `ESMCI::socketClientInit(char const*, int, double)':
/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-2021.3.0_noarch/pkg/v8.4.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:7785: warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
[100%] Built target ufs_model



@natalie-perlin
Copy link
Collaborator

@zach1221 for testing purposes on Gaea, instead of sourcing module-setup.sh, you could just source the Lmod initialize script, and then load the ufs_gaea.intel:

source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
use <ufs-weather-model>/modulefiles
module load ufs_gaea.intel

@natalie-perlin
Copy link
Collaborator

Hi, @natalie-perlin

I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1. image

Steps to reproduce.

  1. clone ufs-wm community:dev repo
  2. cd ufs-weather-model/modulefiles
  3. edit ufs_common.lua change version numbers of mapl to "mapl/2.22.0-esmf-8.4.1", pio to "pio/2.5.10", and esmf to "esmf/8.4.1".
  4. edit ufs_gaea.intel.lua to add module path "/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack"
  5. cd ufs-weather-model/tests
  6. enable tests cpld_control_ciceC_p8 & cpld_control_c192_p8 to run on Gaea by editing rt.conf

Started rt.sh job as well, the compilation is finished successfully, see
/lustre/f2/dev/role.epic/sandbox/UFS-WM/ufs-wm-dev/tests/log_gaea.intel/compile_001.log

@zach1221
Copy link
Collaborator Author

@natalie-perlin I've got it working now. I'll update you here as soon as the tests pass.

@zach1221
Copy link
Collaborator Author

@natalie-perlin latest stack installation at /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/ seems to have resolved the issue regarding cpld_control_ciceC_p8 & cpld_control_c192_p8 on Gaea. fyi @jkbk2004

@jiandewang
Copy link
Collaborator

@natalie-perlin thanks for the information, let me try again when GAEA is back

@natalie-perlin
Copy link
Collaborator

@zach1221 -
Please update the status for the tests on Gaea
A note regarding WM Issue-1724 (#1724): it is a separate issue, and requires new stacks to be built with the compilers available on both C3 and C4 partitions.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 2, 2023

@natalie-perlin is intel/2022.2.1 available on both of C4 and C5?

@zach1221
Copy link
Collaborator Author

zach1221 commented May 2, 2023

@jkbk2004 @natalie-perlin cpld_control_ciceC_p8 & cpld_control_c192_p8 fail on c4 due to the inability to load intel/2021.3.0. I'm testing on c3 now.

@zach1221
Copy link
Collaborator Author

zach1221 commented May 2, 2023

Does c3 partition doesn't have the resources to run these tests? I receive below when attempting cpld_control_ciceC_p8 & cpld_control_c192_p8 on c3.
image

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 2, 2023

Does c3 partition doesn't have the resources to run these tests? I receive below when attempting cpld_control_ciceC_p8 & cpld_control_c192_p8 on c3. image

C3 different architecture

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 2, 2023

It sounds like intel-2022.0.2/classic/oneapi will be most practical option.

@natalie-perlin
Copy link
Collaborator

@natalie-perlin is intel/2022.2.1 available on both of C4 and C5?

C4 (gaea13 check): intel-classic/2022.2.1
C5 (gaea55 check): intel-classic/2022.2.1

Yes, same name on both C4 and C5 partitions

@zach1221
Copy link
Collaborator Author

Ok, I can confirm that cpld_control_ciceC_p8 passes now when testing on ufs-community : develop. However, cpld_control_c192_p8 still fails with the below error. err log here: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_20605/cpld_control_c192_p8
image

@zach1221
Copy link
Collaborator Author

Both cpld_control_ciceC_p8 and cpld_control_c192_p8 run successfully now on Gaea. Logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/logs/RegressionTests_gaea.log

Gaea has been re-enabled for these two tests in ufs-wm #1912 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

5 participants