Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI message loss issue in HPC Cluster #147

Closed
kihangyoun opened this issue Apr 2, 2020 · 3 comments
Closed

MPI message loss issue in HPC Cluster #147

kihangyoun opened this issue Apr 2, 2020 · 3 comments

Comments

@kihangyoun
Copy link

Hi, All

I have some issues related loss of message passing in hpc cluster.
Here is my problem:
(I don't know the version of CMAQ that I used.)
In y_ppm & y_yamo subroutines, the code use SWAP2D & SWAP3D(in swap_sandia_routines.f).
But there are some loss of message passing with between procs.

I have check that the CMAQ currently being deployed does not have swap_sandia, nor does y_yamo. It would be nice to get a new version and test it, but I want to solve the problem in the code that I have.
Could you tell me if there was an issue related to it ?

Actually, I thought it was MPI problem and I asked intel and there are more details in the posting.
If you want, check the address below or I'll get the contents.

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/851369#comment-1955679

@kihangyoun
Copy link
Author

kihangyoun commented Apr 2, 2020

20200402_mpi_error
Q#1

I am writing to ask some questions related to CFD model result different by mpi_hosts order.
I would like to hear your opinion theoretically becuase the code is long and complex and it would be difficult to reproduce it through the sample codes.

The current situation is that when nodes in different infiniband switches perform parallel computations, case #1 works well, but case #2 doesn't work well.
"Doesn't work well" means that there is a difference in values.

Background: host01-host04 in IB switch#1 and host99 in IB switch#2.
Case#1: host01, host02, host03, host04, host99(i.e. header node is hosts01)
Case#2: host99, host01, host02, host03, host04(i.e. header node is hosts99)

As far as I can guess(It's a hypothetical scenario with no theoretical basis),

  1. There are miss communication problems while the header node is on another switch.
  2. Myranks are reversed while working on MPI_COMM_RANK several times.
  3. There are some problems(broken or mismatch) in MPI_COMM_WORLD.
  4. Synchronization excludes header nodes.

First of all, for debugging, I'm putting the print statement in several places to see which subroutine or function changes the value.
(I'll post more when the situation is updated.)
However, no matter what function I finally find, I am not sure it's a part of code-level resolution, so I post to the forum to hear a story about a similar experiences.

Additional#1

Additional information:

  1. This program use only MPI library but OpenMP.
    I have not tried the structures you recommend (IREQ, THREADPRIVATE, NOVECTOR).

  2. Test results
    As I said before, I tried two technique (ISEND and BARRIER) but it doesn't work.

  1. ISEND: Even though it works, the same message loss occurs.
  2. BARRIER: It's a little weird, I'm sure all the procs are going into the subroutine, and they're going into an infinite waiting(hang).
  3. SBUF,RBUF(Jim): I reduce a deallocation as possible, but the same message loss occurs.
  1. I think subroutine is not a problem
    The reason I don't think it's a subroutine problem,
  1. When the host is assigned within the same IB switch, the message has never been lost in 20 repeats.
    ex)
    host004(IB1): 16 17 18 19 20
    host003(IB1): 11 12 13 14 15
    host002(IB1): 06 07 08 09 10
    host001(IB1): 01 02 03 04 05 : always fine
  2. As the East-West communication was always conducted on the same node by adjusting the domain (NROW,NCOL), there was no problem.
    (E-W communication uses same subroutine)
    It has always been a communication between different IB switches that causes problems in South-North communications.
    ex)
    host037(IB2): 16 17 18 19 20 <- message lost occurs in S-N communication
    host003(IB1): 11 12 13 14 15
    host002(IB1): 06 07 08 09 10
    host001(IB1): 01 02 03 04 05
  1. Isn't there a similar reason for hang when using a barrier?
    Aren't the one IB switch nodes (host001-host004) waiting for the other IB switch (host037) but host037 passing through the barrier?

Are there any of these mpi options that can be improved and modified?
I'm going to check if other MPI libaries(openmpi, mvapich, mpich) have the same error.

@kmfoley
Copy link
Collaborator

kmfoley commented Apr 3, 2020

Thank you for your question and your interest in the CMAQ system. We ask that you please post your question to the CMAS Center Forum: https://forum.cmascenter.org/

We would like this question to be documented on the forum to help other users that may run into similar issues.

Please start a 'New Topic' with an informative title and choose 'CMAQ' as the category. This will ensure you are connected to the appropriate developer and user base. I will also pass your question on to the member of our team most familiar with these types of issues so that he can respond to your Forum post if he has any insight.

@kmfoley kmfoley closed this as completed Apr 3, 2020
@dwongepa
Copy link
Contributor

dwongepa commented Apr 3, 2020

Hi kihangyoun,

Could you please contact me directly at wong.david-c@epa.gov? I would like to ask you a few more questions to determine the cause of the problem.

Cheers,
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants