Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ensemble_md to work with gmx_mpi #10

Merged
merged 9 commits into from
Jun 2, 2023
Merged

Enable ensemble_md to work with gmx_mpi #10

merged 9 commits into from
Jun 2, 2023

Conversation

wehs7661
Copy link
Owner

@wehs7661 wehs7661 commented May 2, 2023

The goal of this PR is to allow running EEXE simulations using MPI-enabled GROMACS. Here are some notes and thoughts relevant to this goal. Note that we refer to the original implementation as the commit 4ffb70a, which is the commit right before this PR (i.e. right before the branch is created).

1. The use of mpi4py in the original implementation of EEXE

Here we summarize the purposes of using mpi4py in ensemble_EXE.py for the original implementation.

  • In the original implementation, mpi4py is imported in ensemble_EXE.py using from mpi4py import MPI. Then, global variables comm and rank are created by comm = MPI.COMM_WORLD and rank = comm.Get_rank() so that tasks can be assigned to different ranks. Specifically, most operations in ensemble_EXE.py only use one rank, which is assigned to rank 0 using a conditional statement (if rank == 0:). The only operations performed in parallel are the executions of GROMACS grompp and mdrun commands, which are done by launching subprocess calls under the conditional statement if rank < self.n_sim:.
  • Notably, the original implementation only works for GROMACS with thread-MPI, not for MPI-enabled GROMACS. That is, the subprocess calls do not call mpirun or mpiexec. In fact, as discussed in the next section, calling mpirun or mpiexec using subprocess.run in a code that imports mpi4py could cause issues.

2. The issue of nested MPI calls

To enable MPI-enabled GROMACS in the original implementation of EEXE, the most straightforward approach is probably simply calling MPI-enabled GROMACS (with mpiexec or mpirun) using subprocess.run. This approach was attempted in the commit d2953ef (the first commit below). However, this did not work for the original implementation (commit 4ffb70a), which imported mpi4py using from mpi4py import MPI. To better understand this, here is an example (code.py) demonstrating the issue of calling mpirun using subprocess.run while mpi4py is imported in the code:

import subprocess
from mpi4py import MPI
subprocess.run(['mpiexec', '-n', '1', 'ls'], capture_output=True, text=True, check=True)

Upon execution of python code.py, the following error would occur:

Traceback (most recent call last):
  File "/Users/Wei-TseHsu/Documents/Life_in_CU_Bouler/Research_in_Shirts_Lab/EEXE_experiments/Preliminary_tests/anthracene/test/code.py", line 3, in <module>
  subprocess.run(['mpiexec', '-n', '1', 'ls'], capture_output=True, text=True, check=True)
  File "/usr/local/Cellar/python@3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 528, in run
  raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mpiexec', '-n', '1', 'ls']' returned non-zero exit status 1.

On the other hand, with from mpi4py import MPI removed from the code, the code should work fine with python code.py. This is because that the mpi4py library, when imported, automatically initializes MPI using MPI_Init(), i.e., creates an MPI environment. Specifically, when subprocess.run(['mpiexec', '-n', '1', '/usr/local/bin/gmx', '-version'] is called, the code would try to start another MPI environment within the one that has already been started by mpi4py (i.e. nested MPI calls), which would cause an error like the one shown above, even if the command intended to be executed by mpiexec would not incur an error. (The error comes from calling mpiexec in an MPI environment.)

Here is a relevant discussion about nested MPI calls, though it doesn't seem able to solve our issue here: https://stackoverflow.com/questions/39617250/nesting-mpi-calls-with-mpi4py

3. Possible workarounds

In this section, we discussed different possible workarounds tried for the issue.

3.1. Delay MPI initialization

Specifically, the code above would work if modified as below, which delays the initialization of MPI:

import subprocess
from mpi4py import rc

rc.initialize = False  # Delay MPI initialization

from mpi4py import MPI

subprocess.run(['mpiexec', '-n', '1', 'ls'], capture_output=True, text=True, check=True)

if not MPI.Is_initialized():
    MPI.Init()  # Initialize MPI when you actually need it

This workaround proposes using the same logic for the EEXE implementation. However, it was later found that this did not work, since we still need the variable rank to be able to run GROMACS commands in parallel using mpi4py, but MPI must be initialized to use rank. Also, once MPI is initialized, it cannot be "de-initialized" but can only be finalized (e.g. using MPI.Finalize()). Toggling on and off MPI using MPI.Init() and MPI.Finalize would not make sense since this is not MPI is intended for and this approach would just be awkward.

3.2. Spawn a new process using MPI.COMM_SELF.Spawn

Specifically, the following code is considered in this workaround, which assumes GROMACS to be MPI-compatible:

from mpi4py import MPI
comm = MPI.COMM_SELF.Spawn('/usr/local/bin/gmx', args=['-version'], maxprocs=1)

Notably, MPI.COMM_SELF.Spawn starts an MPI subprocess, creating a separate intercommunication. This workaround proposes spawning new processes in the EEXE implementation. However, this workaround was again, later found not working in our implementation since the execution of the code would just hang there without crashing with any error. This is likely because the command (gmx -version) is not an MPI-aware operation. Since we will still need to run GROMACS commands not designed to be parallelized (e.g. the grompp command), this process is still not suitable for our purpose.

3-3. Discard the use of mpi4py and launch GROMACS simulations with the flag -multidir

As discussed in the first section, we needed mpi4py to

  • Run GROMACS grompp commands in parallel
  • Run GROAMCS mdrun commands in parallel

In our case, different replicas of simulations in EEXE are performed in different folders in parallel. This can actually be done using the -multidir flag with the GROMACS mdrun command, as documented here. Therefore, one possible workaround is to use -multidir to replace the use of mpi4py, and perform GROMACS grompp commands serially, which hopefully would not introduce too much overhead compared to running GROMACS grompp commands in parallel. (There might be other possible ways to run grompp commands in parallel without using mpi4py, but we will explore that later anyway.) Specifically, this workaround proposes using subprocess calls to run only one GROMACS mdrun command (instead of n, where n is the number of replicas) to run multiple simulations in parallel and launch GROMACS grompp commands (using subprocesses) serially.

Notably, the flag -multidir is only available in MPI-enabled GROMACS, so in this workaround, the new implementation of EEXE will be restricted to only working with MPI-enabled GROMACS. (Note that it seems impossible to allow the use of both thread-MPI GROMACS and MPI-enabled GROMACS, since we need mpi4py for thread-MPI GROMACS parallelization, but that would lead to nested MPI calls if MPI-enabled GROMACS is used, which always fail subprocess calls. As such, we need to disable the use of thread-MPI GROMACS in the new implementation of EEXE.) However, this is a reasonable/natural choice, since methods based on replica exchange do not work with thread-MPI GROMACS anyway, and EEXE is intended to be highly parallelized for complex systems.

4. Outcome

Workaround 3-3 (mainly implemented in commits 77415b6) successfully enabled MPI-enabled GROMACS in the new implementation of EEXE and disabled the use of thread-MPI GROMACS.

5. Checklist

  • Implement necessary algorithms to enable MPI-enabled GROMACS.
  • Update the unit tests and pass CI.
  • Update the documentation to reflect the changes.

@wehs7661 wehs7661 changed the title Enable ensemble_md to work with gmxapi Enable ensemble_md to work with gmx_mpi May 2, 2023
@wehs7661 wehs7661 self-assigned this May 3, 2023
@wehs7661 wehs7661 added enhancement New feature or request help wanted Extra attention is needed labels May 3, 2023
@wehs7661 wehs7661 removed the help wanted Extra attention is needed label Jun 2, 2023
@wehs7661 wehs7661 merged commit c661fb8 into master Jun 2, 2023
2 checks passed
@wehs7661 wehs7661 deleted the enable_gmx_mpi branch June 2, 2023 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant