Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpld_control_p8 fails in same spot on Gaea C5 & Hercules w/ spack-stack #1791

Closed
ulmononian opened this issue Jun 7, 2023 · 16 comments
Closed
Labels
bug Something isn't working EPIC Support Requested

Comments

@ulmononian
Copy link
Collaborator

ulmononian commented Jun 7, 2023

Description

strangely, cpld_control_p8 fails during the model run step in the same place on gaea c5 and hercules when using spack-stack/1.4.0, i.e. here:

180:        Wave model ...
180:  WW3 log written to /work2/noaa/epic-ps/cbook/HERCULES/add_hercules/rt_403736/cp
180:  ld_control_p8_intel/./log.ww3
  0:  Starting pFIO input server on Clients
  0:  Starting pFIO output server on Clients
  0:  Character Resource Parameter: ROOT_CF:AERO.rc
  0:  Character Resource Parameter: ROOT_NAME:AERO
  0:  Character Resource Parameter: HIST_CF:AERO_HISTORY.rc
  0:  Character Resource Parameter: EXTDATA_CF:AERO_ExtData.rc
  0:  DU::SetServices: Dust emission scheme is fengsha
  0:  WARNING: falling back on MAPL NUM_BANDS
  0:  GOCART2G::Initialize: Starting...
  0:
  0:  Integer*4 Resource Parameter: RUN_DT:720
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_NO3 already exists. Skipping ...
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_OH already exists. Skipping ...
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_H2O2 already exists. Skipping ...
  0:   oserver is not split
  0:
  0:  EXPSRC:GEOSgcm-v10.16.0
  0:  EXPID: gocart
  0:  Descr: GOCART2g_diagnostics_at_c360
  0:  DisableSubVmChecks: F
  0:
  0:  Reading HISTORY RC Files:
  0:  -------------------------
  0:  NOT using buffer I/O for file: AERO_HISTORY.rc
  0:  NOT using buffer I/O for file: inst_aod.rcx
  0:
  0:  Freq: 00060000  Dur: 00010000  TM:   -1  Collection: inst_aod

to me, the only meaningful line in the err file is: 108: fv3.exe 0000000001D9F59B aerosol_cap_mp_mo 348 Aerosol_Cap.F90 and some mentions of libmpi.so and libc.so (toward the very end).

rundirs are (hercules) /work2/noaa/epic-ps/cbook/HERCULES/add_hercules/rt_403736/cpld_control_p8_intel and (gaea c5) /lustre/f2/dev/role.epic/sandbox/cam_tests/test_c5/rt_232052/cpld_control_p8_intel.

the mapl and esmf versions on each machine are the same (i.e. 2.35.2 and 8.4.2). the specific module env on c5 is:

  1) craype-x86-rome                        26) netcdf-c/4.9.2
  2) perftools-base/23.03.0                 27) netcdf-fortran/4.6.0
  3) xpmem/2.5.2-2.4_3.45__gd0f7936.shasta  28) parallel-netcdf/1.12.2
  4) cray-pmi/6.1.10                        29) parallelio/2.5.9
  5) CmrsEnv/default                        30) esmf/8.4.2
  6) TimeZoneEDT/default                    31) fms/2023.01
  7) DefApps/default                        32) bacio/2.4.1
  8) cray-dsmml/0.2.2                       33) cmake/3.23.1
  9) PrgEnv-intel/8.3.3                     34) crtm-fix/2.4.0_emc
 10) intel-classic/2022.2.1                 35) git-lfs/2.11.0
 11) craype/2.7.20                          36) crtm/2.4.0
 12) stack-intel/2022.2.1                   37) g2/3.4.5
 13) craype-network-ofi                     38) g2tmpl/1.10.2
 14) cray-mpich/8.1.25                      39) ip/3.3.3
 15) stack-cray-mpich/8.1.25                40) sp/2.3.3
 16) python/3.9.12                          41) w3emc/2.9.2
 17) stack-python/3.9.12                    42) gftl/1.8.3
 18) libjpeg/2.1.0                          43) gftl-shared/1.5.0
 19) jasper/2.0.32                          44) ecbuild/3.7.2
 20) zlib/1.2.13                            45) yafyaml/0.5.1
 21) libpng/1.6.37                          46) mapl/2.35.2-esmf-8.4.2
 22) pkg-config/0.29.2                      47) scotch/7.0.3
 23) hdf5/1.14.0                            48) ufs_common
 24) curl/7.66.0                            49) modules.fv3
 25) zstd/1.5.2                             50) nccmp/1.9.0.1

and on hercules:

  1) intel-oneapi-compilers/2022.2.1  19) fms/2023.01
  2) stack-intel/2021.7.1             20) bacio/2.4.1
  3) intel-oneapi-mpi/2021.7.1        21) crtm-fix/2.4.0_emc
  4) stack-intel-oneapi-mpi/2021.7.1  22) git-lfs/3.1.2
  5) stack-python/3.9.14              23) crtm/2.4.0
  6) cmake/3.23.1                     24) g2/3.4.5
  7) libjpeg/2.1.0                    25) g2tmpl/1.10.2
  8) jasper/2.0.32                    26) ip/3.3.3
  9) zlib/1.2.13                      27) sp/2.3.3
 10) libpng/1.6.37                    28) w3emc/2.9.2
 11) hdf5/1.14.0                      29) gftl/1.8.3
 12) curl/8.0.1                       30) gftl-shared/1.5.0
 13) zstd/1.5.2                       31) ecbuild/3.7.2
 14) netcdf-c/4.9.2                   32) yafyaml/0.5.1
 15) netcdf-fortran/4.6.0             33) mapl/2.35.2-esmf-8.4.2
 16) parallel-netcdf/1.12.2           34) scotch/7.0.3
 17) parallelio/2.5.9                 35) ufs_common
 18) esmf/8.4.2                       36) modules.fv3

cpld_control_noaero_p8 works on both machines.

i tried updating the gocart hash and parm/gocart files that @junwang-noaa updates in #1745, but this did not resolve the issue.

To Reproduce:

on gaea:

git clone --recursive -b feature/add_c5 https://github.com/ulmononian/ufs-weather-model.git
cd ufs-weather-model/tests
./rt.sh -a <whichever_account_works> -c -n cpld_control_p8 intel
vim <rt_dir>/out

for hercules, just change the branch to -b feature/add_hercules

Additional context

Needs to be fixed before #1707 , #1733 , or #1784 can be merged.

@ulmononian ulmononian added the bug Something isn't working label Jun 7, 2023
@ulmononian
Copy link
Collaborator Author

ulmononian commented Jun 7, 2023

@mathomp4 fyi

@ulmononian ulmononian changed the title cpld_control_p8 fails in same spot on Gaea C5 and Hercules cpld_control_p8 fails in same spot on Gaea C5 & Hercules w/ spack-stack Jun 7, 2023
@mathomp4
Copy link

mathomp4 commented Jun 7, 2023

@ulmononian Do you have the full log file available?

@ulmononian
Copy link
Collaborator Author

@mathomp4 yes. i've attached both the err and out logs from the hercules run.

hercules_err.txt
hercules_out.txt

@mathomp4
Copy link

mathomp4 commented Jun 7, 2023

@ulmononian Can you compile and run with debugging flags on? Because that is not the most elucidating traceback... Or maybe compile MAPL with debugging?

I have to imagine there is more to that traceback.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jun 7, 2023

@mathomp4 i can quickly do cpld_debug_p8 with -DDEBUG=ON, but we do not install debug versions of mapl/esmf w/ spack-stack anymore (and they were removed from the WM), so i would need to build those later if this first-pass debug option doesn't produce anything useful. thanks for taking the time :) !

@ulmononian
Copy link
Collaborator Author

@mathomp4 sorry for the delay on this. i did run cpld_debug_p8 but the logs were no better. i will need to build the debug versions of esmf and mapl. i'll let you know once i do that and re-run the tests!

@ulmononian
Copy link
Collaborator Author

hi @mathomp4, i re-ran the model w/ debug esmf/mapl. here are those logs:
err.txt

out.txt

thanks again for looking at this!!!

@mathomp4
Copy link

Huh. You are dying in History at:

https://github.com/GEOS-ESM/MAPL/blob/c263c940d2961adddfa859ff233e353dc6a6c34b/gridcomps/History/MAPL_HistoryGridComp.F90#L2336

which is some pretty old code.

Can you share the HISTORY.rc you are using? I might need to call in @bena-nasa (our History expert) to see if maybe the file is oddly set up.

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jun 14, 2023

Huh. You are dying in History at:

https://github.com/GEOS-ESM/MAPL/blob/c263c940d2961adddfa859ff233e353dc6a6c34b/gridcomps/History/MAPL_HistoryGridComp.F90#L2336

which is some pretty old code.

Can you share the HISTORY.rc you are using? I might need to call in @bena-nasa (our History expert) to see if maybe the file is oddly set up.

circling back here, @mathomp4 and i discussed (offline) that the history file in question is https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/gocart/AERO_HISTORY.rc.IN. it appears thatHISTORY.rc is renamed in CAP.rc here:

HIST_CF: AERO_HISTORY.rc
. @mathomp4 is looking into this.

@bena-nasa
Copy link

@that is quite bizarre. Your History file looks fine, you are not doing anything fancy, there's just nothing I see there that could could explain this error. It's almost like this is some sort of compiler bug/memory bug. You are failing on a call that is simply passing in an optional string to a procedure at the call, it's like the memory of the string being passed there is fubarred.
What EXACTLY is different here vs a run that worked. Is it just the compiler stack/library stack for example, did you change source code, input files etc?

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jun 14, 2023

thanks for looking into this @bena-nasa! so far, we have not successfully been able to run the coupled model (S2SWA - ocean + atm w/ waves and aerosols enabled) on hercules or c5. these are machines we are just beginning to use and so are trying to get the weather model working on each.

as for what is different between these failing runs on hercules/c5 and successful runs: it is only the stack. we use a common set of libraries on each machine (so hercules/c5 libraries are identical in version to the other machines where the model works), but the compiler/mpi versions differ between machines. there is nothing different in the model setup/configuration or input data. we have only run into issues w/ compiler/mpi version on NCAR cheyenne, where the gnu/openmpi combination simply does not work with the aerosol model.

it could easily be that we are seeing some unknown compiler/mpi issues with hercules/c5 when the aerosol model is turned on. but given that the model fails in the same spot on each of the machines, i am more inclined to believe it could be a memory issue as you suggested, as the compiler/mpi pair is different on the two machines.

i will look at some node adjustments/stack size settings and see if we can make any progress there.

thanks so much!

@ulmononian
Copy link
Collaborator Author

@mathomp4 following up on the compiler versions on hercules and gaea c5:

hercules: intel-oneapi-compilers/2022.2.1 w/ intel-oneapi-mpi/2021.7.1
gaea c5: intel-classic/2022.2.1 w/ cray-mpich/8.1.25

i can get more info about these if it helpful. please correct me if i misunderstood you, but you were saying that GEOS/MAPL has shown issues running w/ intel compiler versions newer than 2021.9.1?

@mathomp4
Copy link

@ulmononian I think so, yes, but I need to refresh my memory. At the moment, our Intel version of choice is Intel Classic ifort 2021.6.0. I do see we have intel 2021.7 on our cluster, so I can try that out and see, maybe it shows the issue...

The Intel MPI version you use on hercules is good. As for cray-mpich, 🤷🏼 , but if it worked before, probably will now.

@ulmononian
Copy link
Collaborator Author

@ulmononian I think so, yes, but I need to refresh my memory. At the moment, our Intel version of choice is Intel Classic ifort 2021.6.0. I do see we have intel 2021.7 on our cluster, so I can try that out and see, maybe it shows the issue...

The Intel MPI version you use on hercules is good. As for cray-mpich, 🤷🏼 , but if it worked before, probably will now.

unfortunately, we do not have access to compilers older than those listed in my previous comment. do you know if support for the newer intel compilers is planned, and if so, if there's a timeline for that support? perhaps this should be tracked as an issue on the mapl github...

@ulmononian
Copy link
Collaborator Author

this issue was resolved by upgrading intel compiler version to intel/2023.1.0 so that ifort/2021.7.1 is not used (see JCSDA/spack-stack#675 and JCSDA/spack-stack#673).

@natalie-perlin
Copy link
Collaborator

@jkbk2004 the issue is resolved and could be closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working EPIC Support Requested
Projects
Status: Done
Development

No branches or pull requests

5 participants