Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is AWS Pcluster supported by SCHISM? #88

Closed
zeekus opened this issue Sep 27, 2022 · 6 comments
Closed

Question: Is AWS Pcluster supported by SCHISM? #88

zeekus opened this issue Sep 27, 2022 · 6 comments

Comments

@zeekus
Copy link

zeekus commented Sep 27, 2022

Has anyone got schism working on aws pcluster 3.2 ?

I created a 8 node cluster on AWS with 256 cores and 512 GB of RAM. But, on the run I am getting segfaults.

SlurmQueues:
- Name: compute
ComputeResources:
- Name: slurmworkers
InstanceType: c4.8xlarge
MinCount: 0
MaxCount: 8

os: centos7
modules loaded: hdf5-1.12.2-gcc-4.8.5-omqotpp openmpi-4.1.4-gcc-4.8.5-23hmmfu netcdf-fortran-4.5.4-gcc-4.8.5-y6iccqw netcdf-c-4.8.1-gcc-4.8.5-2eml4r3
compiled: GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Binary: -> /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL -version

schism v5.9.0mod
git hash 2e289ae (20 commits since semantic tag, edits=True)

My mpirun call:
--> trying to tell mpi to run 32 processes on each node: Is this syntax correct ?
/opt/parallelcluster/shared/spack/opt/spack/linux-centos7-haswell/gcc-4.8.5/openmpi-4.1.4-23hmmfud3rw4njh3m5ilmukatjrgn4i2/bin/mpirun --hostfile hostnames.txt -n 32 --map-by node /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL

job out says:
7 total processes killed (some possibly by mpirun during cleanup)
did we get an error 139

The error file says invalid memory references.

[centos@ip-10-137-0-172 tune44h]$ cat job.78.err
Currently Loaded Modulefiles:

  1. netcdf-c-4.8.1-gcc-4.8.5-2eml4r3
  2. netcdf-fortran-4.5.4-gcc-4.8.5-y6iccqw
  3. hdf5-1.12.2-gcc-4.8.5-omqotpp
  4. openmpi-4.1.4-gcc-4.8.5-23hmmfu
    Warning: Permanently added 'compute-dy-slurmworkers-4,10.137.0.181' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'compute-dy-slurmworkers-7,10.137.0.179' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'compute-dy-slurmworkers-3,10.137.0.137' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'compute-dy-slurmworkers-6,10.137.0.161' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'compute-dy-slurmworkers-5,10.137.0.133' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'compute-dy-slurmworkers-2,10.137.0.143' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'compute-dy-slurmworkers-8,10.137.0.182' (ECDSA) to the list of known hosts.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7F27901B96D7
#1 0x7F27901B9D1E
#2 0x7F278F4983FF
#3 0x53C5D0 in calkwq_
#4 0x554151 in ecosystem_
#5 0x45F1EC in schism_step_
#6 0x404C4B in schism_main_

@zeekus
Copy link
Author

zeekus commented Sep 27, 2022

seem to get a similar error when I run this on our controller node without mpiexec.

[centos@ip-10-137-0-172 tune44h]$ /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2B815BBEC6D7
#1 0x2B815BBECD1E
#2 0x2B815C89B3FF
#3 0x53C5D0 in calkwq_
#4 0x554151 in ecosystem_
#5 0x45F1EC in schism_step_
#6 0x404C4B in schism_main_
#7 0x404CB6 in MAIN__ at schism_driver.F90:?
Segmentation fault (core dumped)

@zeekus
Copy link
Author

zeekus commented Sep 29, 2022

It appears pschism will work on AWS pcluster if Intel Ifortan and Intel MPI libraries are used. I am thinking that this code may not be compatible with Gfortran. Has anyone successfully run pschism on Gfortran ? To get things to compile on our end I had to modify two files. Maybe there are other Gfortran compatiblity issues I am missing.

Files and lines modified:

  1. File: ./icm_Balg/src/Hydro/schism_init.F90:5451.114
  2. File: ./icm_Balg/src/ICM/icm_sed_flux.F90

Ref: ./icm_Balg/src/Hydro/schism_init.F90:5451.114
Summary: array format issue. All strings need to be the same size for gfortran to compile.

','CPOC ','tlfveg ','tstveg ','trtveg ','hcanveg','lfsav ','stsav ','rts

Error: Different CHARACTER lengths (7/6) in array constructor at (1)
make[3]: *** [Hydro/CMakeFiles/hydro.dir/schism_init.F90.o] Error 1
make[2]: *** [Hydro/CMakeFiles/hydro.dir/all] Error 2
make[1]: *** [Driver/CMakeFiles/pschism.dir/rule] Error 2

Ref: ./icm_Balg/src/ICM/read_icm_input.F90:326:
Summary: Gfortan seems to process '/*' as a closing comment in line 326.

Output:
/modeling/pschism/icm_Balg/src/ICM/read_icm_input.F90:326:0: warning: extra tokens at end of #endif directive [enabled by default]
#endif ICM_PH
^
/modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90:1385:0: error: unterminated comment
!with all state variables in unit of g/*, no need to transfer
^
Error: Unexpected end of file in '/modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90'
make[3]: *** [ICM/CMakeFiles/icm.dir/icm_sed_flux.F90.o] Error 1
make[3]: *** Waiting for unfinished jobs....

@josephzhang8
Copy link
Member

josephzhang8 commented Sep 29, 2022 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Sep 29, 2022 via email

@zeekus
Copy link
Author

zeekus commented Oct 3, 2022

Thanks. It seems I was using the wrong version.

#version pulled
[centos@ip-10-137-0-172 tune44h]$ /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL -version

schism v5.9.0mod
git hash 2e289ae (20 commits since semantic tag, edits=False)

#version 5.10
[centos@ip-10-137-0-172 tune44h]$ /modeling/pschism/icm_Balg_v5.10/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL -version

schism develop
git hash aaa98b3

The proper version seems to be running.

[centos@ip-10-137-0-172 tune44h]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
248 compute pschism_ centos R 19:13 8 compute-dy-slurmworkers-[1-8]

ifort (IFORT) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

@zeekus
Copy link
Author

zeekus commented Sep 20, 2024

I am closing this ticket. I have been recently able to compile schism on gcc-8.5.0 using OpenMPI4, gcc-12.0.2 using OpenMPI5, intel@2021.10.0 with intel-oneapi-mpi@2021.12.1, and intel@2021.6.0 with intel-oneapi-mpi@2021.9.0. It seems my environment for the GCC was either not correctly setup or I had some errors in my cmake file.

@zeekus zeekus closed this as completed Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants