-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI message loss issue in HPC Cluster #147
Comments
I am writing to ask some questions related to CFD model result different by mpi_hosts order. The current situation is that when nodes in different infiniband switches perform parallel computations, case #1 works well, but case #2 doesn't work well. Background: host01-host04 in IB switch#1 and host99 in IB switch#2. As far as I can guess(It's a hypothetical scenario with no theoretical basis),
First of all, for debugging, I'm putting the print statement in several places to see which subroutine or function changes the value. Additional#1 Additional information:
Are there any of these mpi options that can be improved and modified? |
Thank you for your question and your interest in the CMAQ system. We ask that you please post your question to the CMAS Center Forum: https://forum.cmascenter.org/ We would like this question to be documented on the forum to help other users that may run into similar issues. Please start a 'New Topic' with an informative title and choose 'CMAQ' as the category. This will ensure you are connected to the appropriate developer and user base. I will also pass your question on to the member of our team most familiar with these types of issues so that he can respond to your Forum post if he has any insight. |
Hi kihangyoun,
Cheers, |
Hi, All
I have some issues related loss of message passing in hpc cluster.
Here is my problem:
(I don't know the version of CMAQ that I used.)
In y_ppm & y_yamo subroutines, the code use SWAP2D & SWAP3D(in swap_sandia_routines.f).
But there are some loss of message passing with between procs.
I have check that the CMAQ currently being deployed does not have swap_sandia, nor does y_yamo. It would be nice to get a new version and test it, but I want to solve the problem in the code that I have.
Could you tell me if there was an issue related to it ?
Actually, I thought it was MPI problem and I asked intel and there are more details in the posting.
If you want, check the address below or I'll get the contents.
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/851369#comment-1955679
The text was updated successfully, but these errors were encountered: