Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating hpc-stack modules and miniconda locations for Hera, Gaea, Cheyenne, Orion, Jet #1465

Closed
natalie-perlin opened this issue Oct 19, 2022 · 42 comments · Fixed by #1596
Closed
Labels
enhancement New feature or request

Comments

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Oct 19, 2022

Description

Update the locations of the hpc-stack modules and miniconda3 for compiling and running the UFS-weather-model on NOAA HPC systems, such as Hera, Gaea, Cheyenne, Orion, Jet. The modules are installed under role.epic account and placed in a common EPIC-managed space on each system. Gaea also uses the Lmod installed locally in the same common location (ufs-srweather-app/PR-352, ufs-weather-app/PR-353), and needs to run a script to initialize Lmod before loading a modulefile ufs_gaea.intel.lua. While ufs-weather model may use/require python to a lesser extent, the UFS-srweather-app relies heavily on conda environment.

For ease of maintenance of the libraries on the NOAA HPC systems, transition to new location of the modules built for both ufs-weather-model and ufs-srweather-app is needed.

Solution

Repo of the ufs-weather-model to be updated with the new version of miniconda and hpc libraries.

Udated installation locations have been used to load the modules listed in /ufs-weather-model/modulefiles/ufs_common and build the ufs model binaries.
Hera gnu compilers includ
UPD. 10/20/2022: Modules for Hera and Jet have been build for the already tested compiler intel/2022.1.2. Modules for the compiler/impi intel/2022.2.0 also remained and could be used when the upgrade is needed.

UPD. 10/24/2022: Modules for Hera gnu compilers (9.2.0, 10.2.0) and different mpich/openmpi combinations, and updated netcdf/4.9.0 have been prepared.
UPD. 12/07/2022: Added gnu/10.1.0-based hpc-stack on Cheyenne, by a request
UPD. 12/07/2022: Added gnu/10.1.0-based hpc-stack on Cheyenne with mpt/2.22, by a request

Cheyenne Lmod has been upgraded to v.8.7.13 systemwide after system maintenance on 10/21/2022.

Alternatives

Alternative solutions could be to have the hpc libraries and modules built in separate locations for the ufs-weather-model and ufs-srweather-app. The request from EPIC management, however, was to use a common location for the all the libraries.

Related to

A PR-419 in the ufs-srweather-model already exists, and a new PR will be made to the current repo.

Updated locations to load the conda/python and hpc-modules and how to load them on all the systems:

Hera python/miniconda :
module use /scratch1/NCEPDEV/nems/role.epic/miniconda3/modulefiles
module load miniconda3/4.12.0

Hera intel/2022.1.2 + impi/2022.1.2 :
module load intel/2022.1.2
module load impi/2022.1.2
use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2

Hera intel/2022.1.2 + impi/2022.1.2 + netcdf-c 4.9.0:
module load intel/2022.1.2
module load impi/2022.1.2
use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2_ncdf49/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2

Hera gnu/9.2 + mpich/3.3.2 :
module load gnu/9.2
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/9.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2

Hera gnu/10.2 + mpich/3.3.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2

Hera gnu/10.2 + openmpi/4.1.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_openmpi/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load openmpi/4.1.2
module load hpc-openmpi/4.1.2

Hera gnu/9.2 + mpich/3.3.2 + netcdf-c 4.9.0:
module load gnu/9.2
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_ncdf49/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/9.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2

Hera gnu/10.2 + mpich/3.3.2 + netcdf-c/4.9.0:
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_ncdf49/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2

Gaea miniconda::
module use /lustre/f2/dev/role.epic/contrib/modulefiles
module load miniconda3/4.12.0

Gaea intel:
Lmod initialization on Gaea needs to be done first by sourcing the following script:
/lustre/f2/dev/role.epic/contrib/Lmod_init.sh

module use /lustre/f2/dev/role.epic/contrib/modulefiles
module load miniconda3/4.12.0

module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0/modulefiles/stack
module load hpc/1.2.0
module load intel/2021.3.0
module load hpc-intel/2021.3.0
module load hpc-cray-mpich/7.7.11

Cheyenne miniconda:
module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles
module load miniconda3/4.12.0

Cheyenne intel:
module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles
module load miniconda3/4.12.0

module use /glade/work/epicufsrt/contrib/hpc-stack/intel2022.1/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1
module load hpc-mpt/2.25

Cheyenne gnu/10.1.0_mpt2.22:
module use /glade/work/epicufsrt/contrib/hpc-stack/gnu10.1.0_mpt2.22/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.1.0
module load hpc-mpt/2.22

Cheyenne gnu/10.1.0:
module use /glade/work/epicufsrt/contrib/hpc-stack/gnu10.1.0/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.1.0
module load hpc-mpt/2.25

Cheyenne gnu/11.2.0:
module use /glade/work/epicufsrt/contrib/hpc-stack/gnu11.2.0/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/11.2.0
module load hpc-mpt/2.25

Orion miniconda:
module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles
module load miniconda3/4.12.0

Orion intel:
module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles
module load miniconda3/4.12.0

module use /work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2

Jet miniconda:
module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/modulefiles
module load miniconda3/4.12.0

Jet intel:
module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/modulefiles
module load miniconda3/4.12.0

module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2

NB: There were comments in ufs-weather-app/PR-419 suggesting to roll back to lower compiler versions for Cheyenne gnu (to use 11.2.0 instead of 12.1.0), Hera intel (to use intel/2021.1.2 instead of 2022.2.0), and Jet intel (to use intel/2021.1.2 instead of intel/2022.2.0)

Either way could be OK for the SRW, and the libraries would be built for the lower-version compilers as suggested

@natalie-perlin natalie-perlin added the enhancement New feature or request label Oct 19, 2022
@jkbk2004
Copy link
Collaborator

@natalie-perlin Can you make sure all compiler and library versions are confirmed against https://github.com/ufs-community/ufs-weather-model/tree/develop/modulefiles ?

@jkbk2004
Copy link
Collaborator

@ulmononian can we coordinate about intel/gnu/openmpi to hera on this issue?

@natalie-perlin
Copy link
Collaborator Author

@jkbk2004 The PRs have not been made yet to address the changes in modulefiles for the ufs-weather-model, only for the ufs-srweather-app

@natalie-perlin
Copy link
Collaborator Author

The modulefiles for Hera and Jet to use the intel/2022.1.2 version, and not the latest 2022.2.0, version have been built. Updating the info in the top comment of this issue.

@DusanJovic-NOAA
Copy link
Collaborator

Can somebody please build the gnu hpc-stack on hera and cheyenne using openmpi. Thanks.

@ulmononian
Copy link
Collaborator

ulmononian commented Oct 20, 2022

@DusanJovic-NOAA @jkbk2004 here is a build i did in the past w/ gnu-9.2.0 & openmpi-3.1.4 on hera: module use /scratch1/NCEPDEV/stmp2/Cameron.Book/hpcs_work/libs/gnu/stack_noaa/modulefiles/stack

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA @jkbk2004 here is a build i did in the past w/ gnu-9.2.0 & openmpi-3.1.4 on hera: module use /scratch1/NCEPDEV/stmp2/Cameron.Book/hpcs_work/libs/gnu/stack_noaa/modulefiles/stack

Thanks @ulmononian. I also have the gnu/openmpi stack built in my own space. What I was asking is the installation in officially supported location so that we can update modulefiles in develop branch.

@junwang-noaa
Copy link
Collaborator

@ulmononian would you please also create an issue hpc-stack on upp repo (https://github.com/noaa-emc/upp). Also other workflow (global workflow, HAFS workflow) may also be impacted by this change. @WenMeng-NOAA @aerorahul @WalterKolczynski-NOAA @KateFriedman-NOAA @BinLiu-NOAA FYI.

@jkbk2004
Copy link
Collaborator

@junwang-noaa @ulmononian @WenMeng-NOAA @aerorahul @WalterKolczynski-NOAA @KateFriedman-NOAA @BinLiu-NOAA @natalie-perlin I noticed that Kyle's old stack installations are still used in other applications and some machines. I started a coordination on EPIC side. It may take a week or two to finish the full transition. I want to combine this issue with the other library update follow-ups on-going: netcdf/esmf, etc.

@WenMeng-NOAA
Copy link
Contributor

@jkbk2004 Can you install g2tmpl/1.10.2 for the UPP? Thanks!

@jkbk2004
Copy link
Collaborator

@jkbk2004 Can you install g2tmpl/1.10.2 for the UPP? Thanks!

@WenMeng-NOAA g2tmpl/1.10.2 is available (current ufs-wm modulefiles) but backward comparability issue was captured at issue #1441.

@natalie-perlin
Copy link
Collaborator Author

@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.

The stack installation locations are:
/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/
/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/

Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.

The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/

Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).

@natalie-perlin Is anyone going to provide gnu/openmpi stack?

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.
The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/
Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).

@natalie-perlin Is anyone going to provide gnu/openmpi stack?

@ulmononian can you install gnu/openmpi parallel to the location above?

@natalie-perlin
Copy link
Collaborator Author

@jkbk2004 - do we need al four possible combinations for compilers (gnu/9.2.0, gnu/10.2.0) with mpich/3.3.2 , openmpi/4.1.2 ?

@jkbk2004
Copy link
Collaborator

@jkbk2004 - do we need al four possible combinations for compilers (gnu/9.2.0, gnu/10.2.0) with mpich/3.3.2 , openmpi/4.1.2 ?

@natalie-perlin I think @ulmononian has installed gnu10.1/openmpi. That should be good enough as a starting point for openmpi option. But it makes a sense to set openmpi installation available along with the role account path.

@natalie-perlin
Copy link
Collaborator Author

@jkbk2004, @ulmonian -
HPC-modules using different versions gnu, mpich and openmpi were installed, plus new versions of netcdf 4.9.0 (netcdf-c/4.9.0, netcdf-fortran/4.6.0, netcdf-cxx-4.3.1) for the following combinations:

gnu/9.2.0 + mpich/3.3.2 + netcdf/4.7.4
gnu/9.2.0 + mpich/3.3.2 + netcdf/4.9.0
gnu/10.2.0 + mpich/3.3.2 + netcdf/4.7.4
gnu/10.2.0 +mpich/3.3.2 + netcdf/4.9.0
gnu/10.2.0 + openmpi/4.1.2 + netcdf/4.7.4

The updates of the stack locations are made in the top comment of this Issue-1465

@natalie-perlin
Copy link
Collaborator Author

Added a stack build with the intel compiler and netcdf4.9 on Hera (see the list of locations in the top comment)

@ulmononian
Copy link
Collaborator

ulmononian commented Oct 27, 2022

@DusanJovic-NOAA @jkbk2004 @natalie-perlin i will install the stack w/ gnu-9.2 and openmpi-3.1.4 here /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs shortly, as well as w/ gnu-10.1 & openmpi-3.1.4 in the official location.

@ulmononian
Copy link
Collaborator

@DusanJovic-NOAA @jkbk2004 @natalie-perlin hpc-stack built w/ gnu-9.2 and openmpi-3.1.4 was installed successfully here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4.

@DusanJovic-NOAA
Copy link
Collaborator

I tried running the regression test using gnu-9.2_openmpi-3.1.4 stack but it failed because the debug version of esmf library is missing:

$ module load ufs_hera.gnu_debug
Lmod has detected the following error:  The following module(s) are
unknown: "esmf/8.3.0b09-debug"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "esmf/8.3.0b09-debug"

$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4/modulefiles/mpi/gnu/9.2.0/openmpi/3.1.4/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 28 23:20 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems   12 Oct 28 23:20 default -> 8.3.0b09.lua

@DusanJovic-NOAA
Copy link
Collaborator

I also tried 'gnu-10.2_openmpi' stack, but it looks like when I load it, it does not actually load gnu 10.2 module, I see:

$ module list

Currently Loaded Modules:
  1) miniconda3/3.7.3   10) libpng/1.6.37  19) g2tmpl/1.10.0
  2) sutils/default     11) hdf5/1.10.6    20) ip/3.3.3
  3) cmake/3.20.1       12) netcdf/4.7.4   21) sp/2.3.3
  4) hpc/1.2.0          13) pio/2.5.7      22) w3emc/2.9.2
  5) hpc-gnu/10.2       14) esmf/8.3.0b09  23) gftl-shared/v1.5.0
  6) openmpi/4.1.2      15) fms/2022.01    24) mapl/2.22.0-esmf-8.3.0b09
  7) hpc-openmpi/4.1.2  16) bacio/2.4.1    25) ufs_common
  8) jasper/2.0.25      17) crtm/2.4.0     26) ufs_hera.gnu
  9) zlib/1.2.11        18) g2/3.4.5

note, there is no gnu/10.2 module loaded. When I run gcc I see the compiler is version 4.8.5:

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I think this is because, in gnu-10.2_openmpi/modulefiles/core/hpc-gnu/10.2.lua, two lines:

load(compiler)
prereq(compiler)

are missing:

$ cat gnu-10.2_openmpi/modulefiles/core/hpc-gnu/10.2.lua 

...
local compiler = pathJoin("gnu",pkgVersion)

local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"
local mpath = pathJoin(opt,"modulefiles/compiler","gnu",pkgVersion)
prepend_path("MODULEPATH", mpath)
...

which are present in:

$ cat gnu-9.2_openmpi-3.1.4/modulefiles/core/hpc-gnu/9.2.0.lua 

...
local compiler = pathJoin("gnu",pkgVersion)
load(compiler)
prereq(compiler)

local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"
local mpath = pathJoin(opt,"modulefiles/compiler","gnu",pkgVersion)
prepend_path("MODULEPATH", mpath)
...

@DusanJovic-NOAA
Copy link
Collaborator

There is also unnecessary inconsistency in the naming of hpc-gnu module between two versions:

$ ll gnu-9.2_openmpi-3.1.4/modulefiles/core/hpc-gnu/
total 4
-rw-r--r-- 1 role.epic nems 749 Oct 28 22:07 9.2.0.lua
$ ll gnu-10.2_openmpi/modulefiles/core/hpc-gnu/
total 4
-rw-r--r-- 1 role.epic nems 717 Oct 24 12:59 10.2.lua

Why '10.2' and not '10.2.0'? Also the 9.2 stack directory name has openmpi version, while directory for 10.2 stack does not.

@ulmononian
Copy link
Collaborator

I tried running the regression test using gnu-9.2_openmpi-3.1.4 stack but it failed because the debug version of esmf library is missing:

$ module load ufs_hera.gnu_debug
Lmod has detected the following error:  The following module(s) are
unknown: "esmf/8.3.0b09-debug"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "esmf/8.3.0b09-debug"

$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4/modulefiles/mpi/gnu/9.2.0/openmpi/3.1.4/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 28 23:20 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems   12 Oct 28 23:20 default -> 8.3.0b09.lua

my apologies, @DusanJovic-NOAA i will install esmf/8.3.0b09-debug in /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 now and update you when it is finished. we will also address the inconsistency in naming convention and look into the gnu-10.2 modulefile. thank you for testing w/ these stacks.

@ulmononian
Copy link
Collaborator

@DusanJovic-NOAA the stack at /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 has been updated to include esmf/8.3.0b09-debug. i was able to load ufs_common_debug.lua, so hopefully it works for you now!

@natalie-perlin
Copy link
Collaborator Author

@DusanJovic-NOAA, @ulmononian - please note that the GNU 10.2.0 is not installed system-wide on Hera, and only installed locally in EPIC space. It could be build under the current hpc-stack for a particular compiler-gnu-netcdf installation location, but because the compiler is shared between several of such combinations, it is moved to a common location outside a given hpc-stack installation.

Please note that directions to load the compilers and stack given in the first comment address the way the compiler is loaded! For example,
Hera gnu/10.2 + mpich/3.3.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0

module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA the stack at /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 has been updated to include esmf/8.3.0b09-debug. i was able to load ufs_common_debug.lua, so hopefully it works for you now!

@ulmononian Thanks for adding the debug build of esmf. I ran control and control_debug regression tests, both finished successfully. The control tests outputs are not bit identical to the baseline, contol_debug are identical. I guess this is expected due to different MPI library.

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA, @ulmononian - please note that the GNU 10.2.0 is not installed system-wide on Hera, and only installed locally in EPIC space. It could be build under the current hpc-stack for a particular compiler-gnu-netcdf installation location, but because the compiler is shared between several of such combinations, it is moved to a common location outside a given hpc-stack installation.

Please note that directions to load the compilers and stack given in the first comment address the way the compiler is loaded! For example, Hera gnu/10.2 + mpich/3.3.2 : module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

@natalie-perlin I tried to run control and control_debug tests after loading gnu module form the location above (thanks for explaining this, I missed that in the description). The control test compiled successfuly, but failed at run time:

+ sleep 1                                                                                                                            
+ srun --label -n 160 ./fv3.exe                                                                                                      
  1: [h12c01:06674] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
 90: [h20c56:12037] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
 55: [h12c04:153910] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                 
144: [h21c53:84991] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
....
 38: [h12c01:06711] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
 43: --------------------------------------------------------------------------                                                      
 43: The application appears to have been direct launched using "srun",                                                              
 43: but OMPI was not built with SLURM's PMI support and therefore cannot                                                            
 43: execute. There are several options for building PMI support under                                                               
 43: SLURM, depending upon the SLURM version you are using:                                                                          
 43:                                                                                                                                 
 43:   version 16.05 or later: you can use SLURM's PMIx support. This                                                                
 43:   requires that you configure and build SLURM --with-pmix.                                                                      
 43:                                                                                                                                 
 43:   Versions earlier than 16.05: you must use either SLURM's PMI-1 or                                                             
 43:   PMI-2 support. SLURM builds PMI-1 by default, or you can manually                                                             
 43:   install PMI-2. You must then build Open MPI using --with-pmi pointing                                                         
 43:   to the SLURM PMI library location.                                                                                            
 43:                                                                                                                                 
 43: Please configure as appropriate and try again.                                                                                  

@DusanJovic-NOAA
Copy link
Collaborator

Debug version of esmf is missing in gnu-10.2_openmpi stack:

$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_openmpi/modulefiles/mpi/gnu/10.2/openmpi/4.1.2/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 24 14:36 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems   12 Oct 24 14:36 default -> 8.3.0b09.lua

@MichaelLueken
Copy link
Collaborator

@natalie-perlin The SRW App was tested on Hera using the intel/2022.1.2 + impi/2022.1.2 + netcdf-c 4.9.0 stack. All fundamental WE2E tests successfully ran. Testing using netcdf/4.9.0 with the /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack stack location causes the SRW WE2E tests to fail while running the forecast due to illegal characters in the NetCDF files.

It would be interesting to see the differences between the two stacks and see why one version works while the other doesn't.

@grantfirl
Copy link
Collaborator

I think that this is related to this issue: NCAR/ccpp-physics#980

@MichaelLueken
Copy link
Collaborator

Thanks, @grantfirl! Yes, I was seeing the same issue as described in NCAR/ccpp-physics#980. It is nice to see that this won't be an issue once the stack on Hera is transitioned to @natalie-perlin's new stack.

@zach1221
Copy link
Collaborator

zach1221 commented Jan 5, 2023

Hi @MichaelLueken I've tested Natalie's instructions above for loading conda/python and hpc-modules on Hera, Gaea, Cheyenne, Orion and Jet. I did not have any issues.

@MichaelLueken
Copy link
Collaborator

Thanks, @zach1221! That's great news! Once the updated hpc-stack locations for Gaea, Cheyenne, Orion, and Jet are updated in the weather-model, @natalie-perlin will be able to update the locations in the SRW App.

@zach1221
Copy link
Collaborator

zach1221 commented Jan 9, 2023

@natalie-perlin can crtm and gftl-shared be updated to crtm/2.4.0 and gftl-shared/v1.5.0 on Jet? Currently it seems your new module stack location has only crtm/2.3.0 and gftl-shared/1.3.3.

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken @zach1221 -
all resolved for the intel/2022.1.2 on Jet!

@zach1221
Copy link
Collaborator

Hi, @natalie-perlin
@MichaelLueken and I continue to have issues with the new stack on Jet. Have you had the chance to try running any regression tests or SRW yourself, using the new stack?

@natalie-perlin
Copy link
Collaborator Author

@zach1221 @MichaelLueken -
Please remember to recompile/rebuild the SRW or UFS WM with the new stack.

Yes, ran the SRW tests with the new stack on Jet.
The modulefile, build directory and SRW binaries:

/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/ufs-srw-hpc-noAVXs/modulefiles
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/ufs-srw-hpc-noAVXs/build
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/ufs-srw-hpc-noAVXs/exec

The four experiments:

/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_VJET_hpc_new/
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_KJET_hpc_new/
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_SJET_hpc_new/
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_XJET_hpc_new/

@MichaelLueken
Copy link
Collaborator

@natalie-perlin Thanks! I was able to successfully build and run the SRW App's fundamental WE2E tests on Jet using the new HPC-stack location (the run_fcst job even ran using vjet, which would have led to the job failing previously).

@natalie-perlin
Copy link
Collaborator Author

@jkbk2004 - Gaea modules were not updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants