Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NWS = -1 crash with GNU compiler #73

Open
SorooshMani-NOAA opened this issue Jun 14, 2022 · 25 comments
Open

NWS = -1 crash with GNU compiler #73

SorooshMani-NOAA opened this issue Jun 14, 2022 · 25 comments

Comments

@SorooshMani-NOAA
Copy link

Related to noaa-ocs-modeling/PaHM#8

In my workflow I use SCHISM compiled in a Docker image. I'm using GNU compilers on ubuntu for the image. Recently after rebuilding, I noticed a crash when nws=-1. The crash won't happen if I just set nws=0 for the same set of input files. Also when I compile using intel/2021.3.0 I don't see the crash for the exact same input files.

This setup (Docker + GNU) used to work for me. I guess an update in GNU compilers or one of the libraries since the last time I built the image might be the reason behind this issue. (I don't fix the compiler or library versions in that docker image, I just install them using apt)

The details of the trackback can be found in the issue referred to at the top.

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 17, 2022 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 17, 2022 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 17, 2022 via email

@SorooshMani-NOAA
Copy link
Author

I tried with Takis's docker image, it gets past the parwind crash, but then it errors out later in scribe_io.F90:

At line 208 of file /app/schism/src/Core/scribe_io.F90
Fortran runtime error: Attempting to allocate already allocated variable 'iwork3'

Error termination. Backtrace:
At line 208 of file /app/schism/src/Core/scribe_io.F90
Fortran runtime error: Attempting to allocate already allocated variable 'iwork3'

Error termination. Backtrace:
At line 208 of file /app/schism/src/Core/scribe_io.F90
Fortran runtime error: Attempting to allocate already allocated variable 'iwork3'

Error termination. Backtrace:
At line 208 of file /app/schism/src/Core/scribe_io.F90
Fortran runtime error: Attempting to allocate already allocated variable 'iwork3'

Error termination. Backtrace:
#0  0x7f0d5bb94171 in ???
#1  0x7f0d5bb94d19 in ???
#2  0x7f0d5bb950fb in ???
#3  0x433de6 in ???
#4  0x404ed7 in ???
#5  0x404dd1 in ???
#6  0x405146 in ???
#7  0x405182 in ???
#8  0x7f0d5a189ca2 in ???
#9  0x404cfd in ???
#10  0xffffffffffffffff in ???
#0  0x7f196fa2e171 in ???
#1  0x7f196fa2ed19 in ???
#2  0x7f196fa2f0fb in ???
#0  0x7f28b0704171 in ???
#1  0x7f28b0704d19 in ???
#2  0x7f28b07050fb in ???
#3  0x433de6 in ???
#4  0x404ed7 in ???
#5  0x404dd1 in ???
.
.
.
#10  0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[a2f80da41728:00204] *** An error occurred in MPI_Send
[a2f80da41728:00204] *** reported by process [3351379969,2]
[a2f80da41728:00204] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[a2f80da41728:00204] *** MPI_ERR_OTHER: known error not in list
[a2f80da41728:00204] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a2f80da41728:00204] ***    and potentially your MPI job)
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51138,1],7]
  Exit code:    2
--------------------------------------------------------------------------

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 17, 2022 via email

@SorooshMani-NOAA
Copy link
Author

This is a new container (with different gcc and netcdf,etc version) that gave me this new error, I haven't tested this with nws=0 but in my older container, I did confirm that nws=0 was going through.

@SorooshMani-NOAA
Copy link
Author

SorooshMani-NOAA commented Jun 17, 2022

In the case of the new container from Takis, I get the schism scribe_io error even when nws is equal to 0. I'm using e2b943a commit. But I think I saw the same thing with an earlier commit as well.

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 17, 2022 via email

@pvelissariou1
Copy link
Contributor

pvelissariou1 commented Jul 19, 2022

Joseph hi,

I updated the PaHM sources in SCHISM. You might want to check the parwind.F90 file to see if the modifications are ok.
I have compiled and ran SCHISM+PAHM using GFortran/OpenMPI without issues now. It seems they were some memory
allocation conflicts with PaHM.

The flags I used to compile SCHISM with GFortran are:
cmake ../src/ -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_C_COMPILER=mpicc -DNetCDF_Fortran_LIBRARY=$(nc-config --libdir)/libnetcdff.so -DNetCDF_C_LIBRARY=$(nc-config --libdir)/libnetcdf.so -DNetCDF_INCLUDE_DIR=$(nc-config --includedir) -DUSE_PAHM=TRUE -DCMAKE_Fortran_FLAGS_RELEASE="-O2 -ffree-line-length-none -fallow-argument-mismatch"

For debugging I used:
cmake ../src/ -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_C_COMPILER=mpicc -DNetCDF_Fortran_LIBRARY=$(nc-config --libdir)/libnetcdff.so -DNetCDF_C_LIBRARY=$(nc-config --libdir)/libnetcdf.so -DNetCDF_INCLUDE_DIR=$(nc-config --includedir) -DUSE_PAHM=TRUE -DCMAKE_Fortran_FLAGS_RELEASE="-O2 -ggdb -ffree-line-length-none -fallow-argument-mismatch -fallow-invalid-boz -fbacktrace -fno-range-check -fno-unsafe-math-optimizations -frounding-math -fsignaling-nans -ffpe-trap=invalid,zero,overflow -mcmodel=medium -march=k8 -m64 -Wall -Wline-truncation -Wcharacter-truncation -Wsurprising -Waliasing"

NOTICE: If I use -O0 in debugging mode, GFortran issues a segmentation fault due to the creation of temporary arrays
in the call in line 3920 of src/Hydro/schism_init.F90. You might want to check this piece of code.

@josephzhang8
Copy link
Member

josephzhang8 commented Jul 20, 2022 via email

@SorooshMani-NOAA
Copy link
Author

Just for future reference, this ticket is going to be fixed by #75

@pvelissariou1
Copy link
Contributor

pvelissariou1 commented Jul 20, 2022 via email

@pvelissariou1
Copy link
Contributor

pvelissariou1 commented Jul 20, 2022 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jul 20, 2022 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jul 20, 2022 via email

@SorooshMani-NOAA
Copy link
Author

@josephzhang8, I tested the updates in my container with GNU compiler and it doesn't crash anymore during reading hurricane track. I think we can move forward with merging this from the testing perspective

@josephzhang8
Copy link
Member

josephzhang8 commented Jul 21, 2022 via email

@SorooshMani-NOAA
Copy link
Author

shinnecock_sandy2012.zip
@josephzhang8 @pvelissariou1 this is a test case for Shinnecock mesh with parametric wind and tidal setup. I didn't validate anything since there are no stations here, but I know it runs successfully. This can potentially be included in SCHISM regression tests with GNU compilers to make sure PaHM-SCHISM works.

@josephzhang8
Copy link
Member

josephzhang8 commented Jul 22, 2022 via email

@SorooshMani-NOAA
Copy link
Author

@pvelissariou1, do you have any suggestions on the best way to validate this run? There are not COOPS stations to compare against within the Shinnecock mesh.

@SorooshMani-NOAA
Copy link
Author

I talked with @pvelissariou1. There used to be a station in that area for hurricane Sandy, I'll update the setup with that station location and we'll validate the water-level base on it

@SorooshMani-NOAA
Copy link
Author

When I check the data inventory for the inlet station it shows that only data for a brief period in late 2013 and early 2014 is available. I need to ask around and see how to validate given that data is not available from COOPS website.

@josephzhang8
Copy link
Member

josephzhang8 commented Jul 25, 2022 via email

@pvelissariou1
Copy link
Contributor

pvelissariou1 commented Oct 11, 2022 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants