Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hpc stack update (Issue 1465) #1596

Merged
merged 80 commits into from
Feb 9, 2023

Conversation

zach1221
Copy link
Collaborator

@zach1221 zach1221 commented Feb 2, 2023

Description

Updated module paths added to modulefiles for new HPC-Stack locations created on tier-1 on-prem HPCs.
Testing of the new HPC-Stack locations has been conducted successfully on Hera, Jet, Orion, Cheyenne intel/gnu and Hera intel/gnu. Updates also made for autoRT baseline capability and ORT scripts.

Related issue to HPC-Stack updates #1465

New features/additions being added through this PR.
New HPC-Stack module locations added to modulefiles for the below HPCs.

  • Jet intel and intel debug
  • Hera intel and intel debug
  • Hera gnu and gnu debug
  • Cheyenne intel and intel debug
  • Cheyenne gnu and gnu debug
  • Orion intel and intel debug

Updates to the below files for autoRT baseline capability. (Adding new workdir and blstore paths)

  • bl.py

Updates to the below files to correct issues with ORT jenkins-ci pipeline. (Update ORT scripts to correct logic related to TASKS variables and cmp_proc_bind.sh)

  • wrt_env.sh, dcp.sh, mpi.sh, rst.sh, std.sh, thr.sh
  • opnReqTest, run_test.sh
  • jenkinsfile

Top of commit queue on: TBD

Input data additions/changes

No changes are expected to input data.

Anticipated changes to regression tests:

No Baseline Change

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 2, 2023

need to revert hera/gnu to 9.2 and cheyenne/gnu to 10.1

@zach1221
Copy link
Collaborator Author

zach1221 commented Feb 2, 2023

@jkbk2004 I have reverted hera/gnu to 9.2 and cheyenne/gnu to 10.1.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 2, 2023

@jkbk2004 I have reverted hera/gnu to 9.2 and cheyenne/gnu to 10.1.

@natalie-perlin @zach1221 can we make sure about these reverting cases?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 2, 2023

cheyenne.gnu has some issues. Build crashes:
`Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(4)/REAL(8)).
/glade/scratch/jongkim/UFS-RT-tests/rt-1595-gnu/HYCOM-interface/HYCOM/mod_xc_mp.h:1196:25:

1196 | call mpi_bcast(ga(1,j0_pe(mpe_1(np),np)+1), &
| 1`

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 2, 2023

Orion needs update fms/g2tmpl:
Lmod has detected the following error: The following module(s) are unknown: "fms/2022.04" "g2tmpl/1.10.2"

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 2, 2023

On gaea, some issues as well:
Lmod has detected the following error: /lustre/f2/pdata/ncep/Jong.Kim/rt-1595/modulefiles/ufs_gaea.intel: (ufs_gaea.intel): invalid command name "export"

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 2, 2023

@natalie-perlin can you address the above issues?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

@natalie-perlin Build is not going thru on Orion. Please, take a look at: /work/noaa/epic-ps/jongkim/rt-1595/stmp/jongkim/FV3_RT/rt_1450/compile_006

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

@natalie-perlin still same build trouble on gaea: /lustre/f2/scratch/Jong.Kim/FV3_RT/rt_7709/compile_001

@natalie-perlin
Copy link
Collaborator

Updated fms/2022.04 and g2tmpl/1.10.2 on Orion

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

Updated fms/2022.04 and g2tmpl/1.10.2 on Orion
It's still not working on orion.
+ cmake /work/noaa/epic-ps/jongkim/UFS-RT-tests/rt-1595 -DAPP=S2S -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_sfcocn -DCMEPS_AOFLUX=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON CMake Error at /apps/cmake-3.22.1/share/cmake-3.22/Modules/CMakeDetermineCCompiler.cmake:49 (message): Could not find compiler set in environment variable CC:

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Feb 3, 2023

On Orion: Cmake complains about the following env. variables not set:
(See /work/noaa/epic-ps/jongkim/rt-1595/stmp/jongkim/FV3_RT/rt_1450/compile_006/err)

CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_Fortran_COMPILER not set, after EnableLanguage

Suggest to add the following to ./modulefiles/ufs_orion.intel.lua (after L29):

setenv("CMAKE_C_COMPILER", "mpiicc")
setenv("CMAKE_CXX_COMPILER", "mpiicpc")
setenv("CMAKE_Fortran_COMPILER", "mpiifort")

Similar env. variable settings to be added to ufs_orion.intel_debug.lua.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Feb 3, 2023

@jkbk2004, @zach1221 - regarding the issue on Gaea:
Please set the correct module environment for Gaea. Use the following EPIC-managed Lmod installed on Gaea in module-setup.sh , following the line that states elif [[ $MACHINE_ID = gaea* ]] ; then

      
    if ( ! eval module help > /dev/null 2>&1 ) ; then
      # Use EPIC-managed Lmod installed on Gaea
      export BASH_ENV="/lustre/f2/dev/role.epic/contrib/apps/lmod/lmod/init/bash"
      source $BASH_ENV
      export LMOD_SYSTEM_DEFAULT_MODULES="modules/3.2.11.4"
      module --initial_load --no_redirect restore
    fi

The following lines need then to be removed from compile.sh, as they likely overwrite the module environment initialization done in setup-modules.sh
# Activate lua environment for gaea
if [[ $MACHINE_ID == gaea.* ]] ; then
source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
fi

@natalie-perlin
Copy link
Collaborator

The module on Gaea needs to be converted to lua format. This is the modulefile ufs_gaea.intel.lua :

help([[
  This module loads libraries required for building and running UFS Weather Model 
  on the NOAA RDHPC machine Gaea using Intel-2022.1.2
]])

whatis([===[Loads libraries needed for building the UFS Weather Model on Gaea ]===])

prepend_path("MODULEPATH", "/lustre/f2/dev/role.epic/contrib/modulefiles")
load(pathJoin("miniconda3",os.getenv("miniconda_ver") or "4.12.0"))

load(pathJoin("cmake", os.getenv("cmake_ver") or "3.20.1"))

prepend_path("MODULEPATH","/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0/modulefiles/stack")
load(pathJoin("hpc", os.getenv("hpc_ver") or "1.2.0"))
load(pathJoin("intel", os.getenv("intel_ver") or "2021.3.0"))
load(pathJoin("hpc-intel", os.getenv("hpc_intel_ver") or "2021.3.0"))
load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.11"))
load(pathJoin("gcc", os.getenv("gcc_ver") or "8.3.0"))
load(pathJoin("libpng", os.getenv("libpng_ver") or "1.6.37"))

-- needed for WW3 build
load(pathJoin("gcc", os.getenv("gcc_ver") or "8.3.0"))
-- Needed at runtime:
load("alps")
load("rocoto")

load("ufs_common")

setenv("CC","cc")
setenv("FC","ftn")
setenv("CXX","CC")
setenv("CMAKE_C_COMPILER","cc")
setenv("CMAKE_CXX_COMPILER","CC")
setenv("CMAKE_Fortran_COMPILER","ftn")
setenv("CMAKE_Platform","gaea.intel")

@natalie-perlin
Copy link
Collaborator

The modulefile ufs_gaea.intel_debug.lua is similar to ufs_gaea.intel.lua, except for this line:
load("ufs_common_debug")

@natalie-perlin
Copy link
Collaborator

@zach1221 , @jkbk2004

@jkbk2004 I have reverted hera/gnu to 9.2 and cheyenne/gnu to 10.1.

What were the questions or issues with higher order compilers on Hera and Cheyenne?

@zach1221 zach1221 added the jenkins-ci Jenkins CI: ORT build/test on docker container label Feb 3, 2023
@zach1221
Copy link
Collaborator Author

zach1221 commented Feb 3, 2023

@natalie-perlin I think it was just to recheck/confirm that there were no issues with the below compilers on these two HPCs.
image

@zach1221
Copy link
Collaborator Author

zach1221 commented Feb 3, 2023

@natalie-perlin I added the "setenv" variable for cmake compiler to the Orion module lua file but still failed with same "language not enabled error".

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

On Orion: Cmake complains about the following env. variables not set: (See /work/noaa/epic-ps/jongkim/rt-1595/stmp/jongkim/FV3_RT/rt_1450/compile_006/err)

CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_Fortran_COMPILER not set, after EnableLanguage

Suggest to add the following to ./modulefiles/ufs_orion.intel.lua (after L29):

setenv("CMAKE_C_COMPILER", "mpiicc")
setenv("CMAKE_CXX_COMPILER", "mpiicpc")
setenv("CMAKE_Fortran_COMPILER", "mpiifort")

Similar env. variable settings to be added to ufs_orion.intel_debug.lua.

@natalie-perlin I tried but it still not working.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

The module on Gaea needs to be converted to lua format. This is the modulefile ufs_gaea.intel.lua :

help([[
  This module loads libraries required for building and running UFS Weather Model 
  on the NOAA RDHPC machine Gaea using Intel-2022.1.2
]])

whatis([===[Loads libraries needed for building the UFS Weather Model on Gaea ]===])

prepend_path("MODULEPATH", "/lustre/f2/dev/role.epic/contrib/modulefiles")
load(pathJoin("miniconda3",os.getenv("miniconda_ver") or "4.12.0"))

load(pathJoin("cmake", os.getenv("cmake_ver") or "3.20.1"))

prepend_path("MODULEPATH","/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0/modulefiles/stack")
load(pathJoin("hpc", os.getenv("hpc_ver") or "1.2.0"))
load(pathJoin("intel", os.getenv("intel_ver") or "2021.3.0"))
load(pathJoin("hpc-intel", os.getenv("hpc_intel_ver") or "2021.3.0"))
load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.11"))
load(pathJoin("gcc", os.getenv("gcc_ver") or "8.3.0"))
load(pathJoin("libpng", os.getenv("libpng_ver") or "1.6.37"))

-- needed for WW3 build
load(pathJoin("gcc", os.getenv("gcc_ver") or "8.3.0"))
-- Needed at runtime:
load("alps")
load("rocoto")

load("ufs_common")

setenv("CC","cc")
setenv("FC","ftn")
setenv("CXX","CC")
setenv("CMAKE_C_COMPILER","cc")
setenv("CMAKE_CXX_COMPILER","CC")
setenv("CMAKE_Fortran_COMPILER","ftn")
setenv("CMAKE_Platform","gaea.intel")

This should be separate feature. We are not using lua on gaea yet. We need to confirm your hpc stack installation work first.

@natalie-perlin
Copy link
Collaborator

On Orion: Cmake complains about the following env. variables not set: (See /work/noaa/epic-ps/jongkim/rt-1595/stmp/jongkim/FV3_RT/rt_1450/compile_006/err)

CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_Fortran_COMPILER not set, after EnableLanguage

Suggest to add the following to ./modulefiles/ufs_orion.intel.lua (after L29):

setenv("CMAKE_C_COMPILER", "mpiicc")
setenv("CMAKE_CXX_COMPILER", "mpiicpc")
setenv("CMAKE_Fortran_COMPILER", "mpiifort")

Similar env. variable settings to be added to ufs_orion.intel_debug.lua.

@natalie-perlin I tried but it still not working.

@jkbk2004 -
regarding Orion, where is the directory with the WM code that is being built?

@natalie-perlin
Copy link
Collaborator

@natalie-perlin I added the "setenv" variable for cmake compiler to the Orion module lua file but still failed with same "language not enabled error".

Let me know where is the modulefile is and the log with the error

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

@natalie-perlin can you read the crashing errors and paths above?

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Feb 3, 2023

@natalie-perlin can you read the crashing errors and paths above?

@jkbk2004 @zach1221 - Not finding the log file with the "language not enabled error" message on Orion

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

Please, take a look /work/noaa/epic-ps/jongkim/rt-1595/stmp/jongkim/FV3_RT/rt_1450/compile_006/err

@natalie-perlin
Copy link
Collaborator

@natalie-perlin I think it was just to recheck/confirm that there were no issues with the below compilers on these two HPCs. image

Cheyenne gnu/10.1.0 is verified.

Hera gnu/9.2 in /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ is verified as well (with mpich/3.3.2)

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 3, 2023

Cheyenne.gnu passes ok!

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Feb 3, 2023

Please, take a look /work/noaa/epic-ps/jongkim/rt-1595/stmp/jongkim/FV3_RT/rt_1450/compile_006/err

This log file with the error has a timestamp from earlier time today, before the modulefile ufs_orion.intel.lua changed. Are there any tests that have been run following the modulefile update?

UPD: ... Found new test locations, looking at them now

@zach1221
Copy link
Collaborator Author

zach1221 commented Feb 9, 2023

@jkbk2004 gaea modulefiles are reverted.

@zach1221 zach1221 linked an issue Feb 9, 2023 that may be closed by this pull request
@zach1221
Copy link
Collaborator Author

zach1221 commented Feb 9, 2023

@zach1221 zach1221 marked this pull request as ready for review February 9, 2023 16:04
@jkbk2004 jkbk2004 added Waiting for Reviews The PR is waiting for reviews from associated component PR's. and removed jenkins-ci Jenkins CI: ORT build/test on docker container labels Feb 9, 2023
@jkbk2004 jkbk2004 merged commit 0c8e74c into ufs-community:develop Feb 9, 2023
@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Feb 9, 2023

@jkbk2004 - Gaea transition has not been done yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
6 participants