Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem about MPI #51

Closed
prlWanted opened this issue May 14, 2018 · 7 comments
Closed

A problem about MPI #51

prlWanted opened this issue May 14, 2018 · 7 comments
Labels
needs-user-input the issue cannot be resolved without additional information

Comments

@prlWanted
Copy link

Dear developers,
Thank you for your advanced code. Before I installed SMILEI, I installed gcc-4.6.4, openmpi-1.10.2, hdf5-1.8.16 and python-2.7 for dependencies. Then I installed SMILEI successfully. I run the namelist (tst1d_0_em_propagation.py) in the benchmarks directory for test. I got a error message when the computer was trying to initialize the diagnostic fields about MPI: An error occurred in MPI_Comm_create_keyval reported by process [703004673,12] on communicator MPI_COMM_WORLD; MPI_ERR_ARG: invalid argument of some other kind. The submission script is as follow:
#/bin/bas
#PBS -N smilei
#PBS -l nodes=1:ppn=16
#PBS -l walltime=550:50:00
#PBS -j oe
#PBS -q high
cd $PBS_O_WORKDIR
mpiexec -n 16 ./smilei benchmarks/tst1d_0_em_propagation.py
exit 0
It seems that SMILEI is based on the DSM supercomputer. But I use a SPM supercomputer. Is that the reason of the problem? How do I fix it? Thank you.

@mccoys
Copy link
Contributor

mccoys commented May 14, 2018

Hello,

Is it possible for you to attach your output and error logs ?
That would make the error more clear.

@prlWanted
Copy link
Author

Of course. Please find the output file in the attachment.

smilei.pdf

@jderouillat
Copy link
Contributor

Hi,
We never saw this kind of error, by the way there is, a priori, no usage of MPI_Comm_create_keyval directly in the code.
Can you gave us some complementary informations such as a description of the supercomputer that you are using ? And if the code runs successfully with a single MPI process and many OpenMP threads ?
If it crashes in this case too, you could run it within a debugger in a debug mode (make config=debug) to identify the origin of the error.

@mccoys
Copy link
Contributor

mccoys commented May 15, 2018

I don't think the error is related, but you have 16 processes with 8 threads each. Total of 128 threads. This is way too much compared to the 16 patches available in the benchmark you choose.

Use 1 process with 16 threads or 2 processes with 8 threads instead.

@prlWanted
Copy link
Author

@jderouillat Sure. The supercomputer I am using consists of 44 Dawning CB85-G AMD blade machine. Each node (machine) has four eight-core 64 bits CPU, which constitute 8 symmetrical multi-processor. The internal memory of each node is 8 GB. The CPUs are Opeteron 6136 (2.4GHz) CPUs, each of which has 16 GB internal memory. The supercomputer uses the RedHat Linux 5.0 operation system.

@mccoys I run SMILEI with a single MPI process and 16 threads, 2 processes with 8 threads for 16 patches. But I still get the same error. Then I run the code in the debug mode. Below is the output file.
smilei.pdf

@jderouillat
Copy link
Contributor

Could you execute the debug version within gdb (with only 1 MPI and 1 OpenMP, it will be easier to read the generated informations and OpenMP doesn't matter in your case, the simulation has not entered the OpenMP parallel section when it crashes) :

$ mpirun -np 1 gdb --args .../smilei .../tst1d_00_em_propagation.py
(gdb) run
...
(gdb) backtrace
...

And send us the stack trace of the crash ?

@mccoys mccoys added the needs-user-input the issue cannot be resolved without additional information label Mar 5, 2019
@mccoys
Copy link
Contributor

mccoys commented Mar 12, 2019

Closing as no more input from user

@mccoys mccoys closed this as completed Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-user-input the issue cannot be resolved without additional information
Projects
None yet
Development

No branches or pull requests

3 participants