Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LammpsLibrary hangs on close #176

Open
pmrv opened this issue Feb 15, 2024 · 12 comments
Open

LammpsLibrary hangs on close #176

pmrv opened this issue Feb 15, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@pmrv
Copy link

pmrv commented Feb 15, 2024

With the latest version, 0.2.13, the interactive lammps sessions work, but don't properly clean up. Running the snippet below never finishes (on the cmti cluster)

from pylammpsmpi import LammpsLibrary
import pylammpsmpi

lmp = LammpsLibrary(2)

lmp.version, pylammpsmpi.__version__

lmp.close() # <- hangs indefinitely 

I've watched the lmpmpi.py process with top, and it does disappear when close is called, but apparently that's not properly communicated back to the foreground process.

When I run this snippet on my laptop in a fresh conda environment it hangs similarly, but also prints this warning

[cmleo38:13075] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
[cmleo38:13074] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
@pmrv pmrv added the bug Something isn't working label Feb 15, 2024
@jan-janssen
Copy link
Member

@pmrv It works for me, so I presume it is related to the setup on the cluster @niklassiemer can you comment on this?

@pmrv
Copy link
Author

pmrv commented Feb 15, 2024

It also doesn't work on a local machine for me. I suppose it might be MPI related?

@niklassiemer
Copy link
Member

I do not have a clue...

@pmrv
Copy link
Author

pmrv commented Feb 15, 2024

So it does work on a local machine now. It seems the warning I posted above is just a red herring.

@jan-janssen
Copy link
Member

jan-janssen commented Feb 16, 2024

@pmrv Can you try if interfacing with the MPI process directly fixes the issue?

import os
import pylammpsmpi
from pympipool.shared import interface_bootup, MpiExecInterface
interface = interface_bootup(
    command_lst=["python", os.path.join(os.path.dirname(pylammpsmpi.__file__), "mpi/lmpmpi.py")], 
    connections=MpiExecInterface(cwd=None, cores=2),
)
interface.send_and_receive_dict(input_dict={"command": "get_version", "args": []})
interface.shutdown(wait=True)

@pmrv
Copy link
Author

pmrv commented Feb 16, 2024

It still hangs, but in a different location. I took this stack trace after ~30min

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[1], line 9
      4 interface = interface_bootup(
      5     command_lst=["python", os.path.join(os.path.dirname(pylammpsmpi.__file__), "mpi/lmpmpi.py")], 
      6     connections=MpiExecInterface(cwd=None, cores=2),
      7 )
      8 interface.send_and_receive_dict(input_dict={"command": "get_version", "args": []})
----> 9 interface.shutdown(wait=True)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/communication.py:84, in SocketInterface.shutdown(self, wait)
     80 if self._interface.poll():
     81     result = self.send_and_receive_dict(
     82         input_dict={"shutdown": True, "wait": wait}
     83     )
---> 84     self._interface.shutdown(wait=wait)
     85 if self._socket is not None:
     86     self._socket.close()

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/interface.py:49, in SubprocessInterface.shutdown(self, wait)
     48 def shutdown(self, wait=True):
---> 49     self._process.communicate()
     50     self._process.terminate()
     51     if wait:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1146, in Popen.communicate(self, input, timeout)
   1144         stderr = self.stderr.read()
   1145         self.stderr.close()
-> 1146     self.wait()
   1147 else:
   1148     if timeout is not None:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1209, in Popen.wait(self, timeout)
   1207     endtime = _time() + timeout
   1208 try:
-> 1209     return self._wait(timeout=timeout)
   1210 except KeyboardInterrupt:
   1211     # https://bugs.python.org/issue25942
   1212     # The first keyboard interrupt waits briefly for the child to
   1213     # exit under the common assumption that it also received the ^C
   1214     # generated SIGINT and will exit rapidly.
   1215     if timeout is not None:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1959, in Popen._wait(self, timeout)
   1957 if self.returncode is not None:
   1958     break  # Another thread waited.
-> 1959 (pid, sts) = self._try_wait(0)
   1960 # Check the pid and loop as waitpid has been known to
   1961 # return 0 even without WNOHANG in odd situations.
   1962 # http://bugs.python.org/issue14396.
   1963 if pid == self.pid:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1917, in Popen._try_wait(self, wait_flags)
   1915 """All callers to this function MUST hold self._waitpid_lock."""
   1916 try:
-> 1917     (pid, sts) = os.waitpid(self.pid, wait_flags)
   1918 except ChildProcessError:
   1919     # This happens if SIGCLD is set to be ignored or waiting
   1920     # for child processes has otherwise been disabled for our
   1921     # process.  This child is dead, we can't get the status.
   1922     pid = self.pid

KeyboardInterrupt: 

Compared to the original stack trace

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[24], line 1
----> 1 lmp.interactive_close()

File [~/software/pyiron_atomistics/pyiron_atomistics/lammps/interactive.py:581](https://localhost:8000/user/zora/lab/tree/zora/scratch/~/software/pyiron_atomistics/pyiron_atomistics/lammps/interactive.py#line=580), in LammpsInteractive.interactive_close(self)
    579 def interactive_close(self):
    580     if self.interactive_is_activated():
--> 581         self._interactive_library.close()
    582         super(LammpsInteractive, self).interactive_close()
    583         with self.project_hdf5.open("output") as h5:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pylammpsmpi/wrapper/ase.py:356, in LammpsASELibrary.close(self)
    354 def close(self):
    355     if self._interactive_library is not None:
--> 356         self._interactive_library.close()

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pylammpsmpi/wrapper/concurrent.py:652, in LammpsConcurrent.close(self)
    650 cancel_items_in_queue(que=self._future_queue)
    651 self._future_queue.put({"shutdown": True, "wait": True})
--> 652 self._process.join()
    653 self._process = None

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/thread.py:29, in RaisingThread.join(self, timeout)
     28 def join(self, timeout=None):
---> 29     super().join(timeout=timeout)
     30     if self._exception:
     31         raise self._exception

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/threading.py:1096, in Thread.join(self, timeout)
   1093     raise RuntimeError("cannot join current thread")
   1095 if timeout is None:
-> 1096     self._wait_for_tstate_lock()
   1097 else:
   1098     # the behavior of a negative timeout isn't documented, but
   1099     # historically .join(timeout=x) for x<0 has acted as if timeout=0
   1100     self._wait_for_tstate_lock(timeout=max(timeout, 0))

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/threading.py:1116, in Thread._wait_for_tstate_lock(self, block, timeout)
   1113     return
   1115 try:
-> 1116     if lock.acquire(block, timeout):
   1117         lock.release()
   1118         self._stop()

KeyboardInterrupt:

@jan-janssen
Copy link
Member

This might also be fixed by pyiron/executorlib#279

@jan-janssen
Copy link
Member

Another indication that this was caused by the issue in pympipool is that now it is possible to discover the tests in pyiron_lammps pyiron/pyiron_lammps#119 which itself contains tests where LAMMPS is executed on parallel and previously these tests did not close correctly when being executed via unittest discover.

@pmrv
Copy link
Author

pmrv commented Feb 22, 2024

So with the latest changes from pympipool it works on the cluster in a python shell, but not in a notebook/lab environment.

@jan-janssen
Copy link
Member

As discussed, running LAMMPS with multiple cores requires LAMMPS with mpi support. You can check this using conda list lammps. And you can force the installation of a specific LAMMPS build with openmpi support using conda install lammps=*=*openmpi*.

@pmrv
Copy link
Author

pmrv commented Feb 22, 2024

So with the correct lammps the simple examples above seem to work, but interactive pyiron jobs or calphy are still stuck. I have to double check all my versions and then update here.

@pmrv
Copy link
Author

pmrv commented Feb 29, 2024

So here's a small data point: @srmnitc and I managed to make it work on the cluster with the following env

  - openmpi=4.1.6=hc5af2df_101
  - mpi4py=3.1.4=py311h4267d7f_1
  - pylammpsmpi=0.2.13=pyhc1e730c_0
  - pympipool=0.7.13=pyhd8ed1ab_0

and mpi4py=3.1.4 is apparently critical, because as soon as I upgraded it to 3.1.5 it stopped working again.

@jan-janssen
Copy link
Member

jan-janssen commented Mar 4, 2024

@pmrv Did you try the new pympipool version? Or are you still waiting for pyiron_atomistics to be compatible to the new pympipool version? The following combination of versions should work as well:

  - openmpi=4.1.6
  - mpi4py=3.1.5
  - pylammpsmpi=0.2.15
  - pympipool=0.7.17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants