Enable ensemble_md
to work with gmx_mpi
#10
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The goal of this PR is to allow running EEXE simulations using MPI-enabled GROMACS. Here are some notes and thoughts relevant to this goal. Note that we refer to the original implementation as the commit 4ffb70a, which is the commit right before this PR (i.e. right before the branch is created).
1. The use of mpi4py in the original implementation of EEXE
Here we summarize the purposes of using
mpi4py
inensemble_EXE.py
for the original implementation.mpi4py
is imported inensemble_EXE.py
usingfrom mpi4py import MPI
. Then, global variablescomm
andrank
are created bycomm = MPI.COMM_WORLD
andrank = comm.Get_rank()
so that tasks can be assigned to different ranks. Specifically, most operations inensemble_EXE.py
only use one rank, which is assigned to rank 0 using a conditional statement (if rank == 0:
). The only operations performed in parallel are the executions of GROMACSgrompp
andmdrun
commands, which are done by launching subprocess calls under the conditional statementif rank < self.n_sim:
.mpirun
ormpiexec
. In fact, as discussed in the next section, callingmpirun
ormpiexec
usingsubprocess.run
in a code that importsmpi4py
could cause issues.2. The issue of nested MPI calls
To enable MPI-enabled GROMACS in the original implementation of EEXE, the most straightforward approach is probably simply calling MPI-enabled GROMACS (with
mpiexec
ormpirun
) usingsubprocess.run
. This approach was attempted in the commit d2953ef (the first commit below). However, this did not work for the original implementation (commit 4ffb70a), which importedmpi4py
usingfrom mpi4py import MPI
. To better understand this, here is an example (code.py
) demonstrating the issue of callingmpirun
usingsubprocess.run
whilempi4py
is imported in the code:Upon execution of
python code.py
, the following error would occur:On the other hand, with
from mpi4py import MPI
removed from the code, the code should work fine withpython code.py
. This is because that thempi4py
library, when imported, automatically initializes MPI usingMPI_Init()
, i.e., creates an MPI environment. Specifically, whensubprocess.run(['mpiexec', '-n', '1', '/usr/local/bin/gmx', '-version']
is called, the code would try to start another MPI environment within the one that has already been started by mpi4py (i.e. nested MPI calls), which would cause an error like the one shown above, even if the command intended to be executed bympiexec
would not incur an error. (The error comes from callingmpiexec
in an MPI environment.)Here is a relevant discussion about nested MPI calls, though it doesn't seem able to solve our issue here: https://stackoverflow.com/questions/39617250/nesting-mpi-calls-with-mpi4py
3. Possible workarounds
In this section, we discussed different possible workarounds tried for the issue.
3.1. Delay MPI initialization
Specifically, the code above would work if modified as below, which delays the initialization of MPI:
This workaround proposes using the same logic for the EEXE implementation. However, it was later found that this did not work, since we still need the variable
rank
to be able to run GROMACS commands in parallel usingmpi4py
, but MPI must be initialized to userank
. Also, once MPI is initialized, it cannot be "de-initialized" but can only be finalized (e.g. usingMPI.Finalize()
). Toggling on and off MPI usingMPI.Init()
andMPI.Finalize
would not make sense since this is not MPI is intended for and this approach would just be awkward.3.2. Spawn a new process using
MPI.COMM_SELF.Spawn
Specifically, the following code is considered in this workaround, which assumes GROMACS to be MPI-compatible:
Notably,
MPI.COMM_SELF.Spawn
starts an MPI subprocess, creating a separate intercommunication. This workaround proposes spawning new processes in the EEXE implementation. However, this workaround was again, later found not working in our implementation since the execution of the code would just hang there without crashing with any error. This is likely because the command (gmx -version
) is not an MPI-aware operation. Since we will still need to run GROMACS commands not designed to be parallelized (e.g. thegrompp
command), this process is still not suitable for our purpose.3-3. Discard the use of
mpi4py
and launch GROMACS simulations with the flag-multidir
As discussed in the first section, we needed
mpi4py
togrompp
commands in parallelmdrun
commands in parallelIn our case, different replicas of simulations in EEXE are performed in different folders in parallel. This can actually be done using the
-multidir
flag with the GROMACSmdrun
command, as documented here. Therefore, one possible workaround is to use-multidir
to replace the use ofmpi4py
, and perform GROMACSgrompp
commands serially, which hopefully would not introduce too much overhead compared to running GROMACSgrompp
commands in parallel. (There might be other possible ways to rungrompp
commands in parallel without usingmpi4py
, but we will explore that later anyway.) Specifically, this workaround proposes using subprocess calls to run only one GROMACSmdrun
command (instead ofn
, wheren
is the number of replicas) to run multiple simulations in parallel and launch GROMACSgrompp
commands (using subprocesses) serially.Notably, the flag
-multidir
is only available in MPI-enabled GROMACS, so in this workaround, the new implementation of EEXE will be restricted to only working with MPI-enabled GROMACS. (Note that it seems impossible to allow the use of both thread-MPI GROMACS and MPI-enabled GROMACS, since we need mpi4py for thread-MPI GROMACS parallelization, but that would lead to nested MPI calls if MPI-enabled GROMACS is used, which always fail subprocess calls. As such, we need to disable the use of thread-MPI GROMACS in the new implementation of EEXE.) However, this is a reasonable/natural choice, since methods based on replica exchange do not work with thread-MPI GROMACS anyway, and EEXE is intended to be highly parallelized for complex systems.4. Outcome
Workaround 3-3 (mainly implemented in commits 77415b6) successfully enabled MPI-enabled GROMACS in the new implementation of EEXE and disabled the use of thread-MPI GROMACS.
5. Checklist