Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch multiprocessing sync crash with python barrier. #29855

Closed
ailzhang opened this issue Nov 14, 2019 · 4 comments
Closed

Pytorch multiprocessing sync crash with python barrier. #29855

ailzhang opened this issue Nov 14, 2019 · 4 comments
Assignees
Labels
module: multiprocessing Related to torch.multiprocessing triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ailzhang
Copy link
Contributor

This is an issue original found in pytorch/xla repo by @jysohn23 @dlibenzi .

import multiprocessing as mp
import os
import random
import time
import torch

def _mp_fn(rank, barrier):
    sleep_time = random.randint(0, 10)
    print('_mp_fn: rank={}, pid={}, sleep_time={}'.format(rank, os.getpid(), sleep_time), flush=True)
    time.sleep(sleep_time)
    barrier.wait()
    print('_mp_fn: wait complete rank={}, pid={}'.format(rank, os.getpid()), flush=True)

if __name__ == '__main__':
    num_procs = 8
    barrier = mp.Barrier(num_procs)
    torch.multiprocessing.spawn(_mp_fn, args=(barrier,), nprocs=num_procs)
@ailzhang ailzhang added the module: multiprocessing Related to torch.multiprocessing label Nov 14, 2019
@li-roy li-roy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 18, 2019
@ailzhang
Copy link
Contributor Author

ailzhang commented Dec 5, 2019

@VitalyFedyunin any update here? ;)
cc: @jysohn23

@albanD
Copy link
Collaborator

albanD commented Jul 7, 2023

The call seems to be segfaulting when trying to acquire the lock from the barrier in the child process.
Running in gdb with set schedule-multiple, set detach-on-fork off, you can get a stack trace from the child pointing at this:

(gdb) bt
#0  0x00007ffff7d7af2f in __new_sem_wait_fast (definitive_result=1, sem=0x7fffea256000 <_init>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/nptl/sem_waitcommon.c:141
#1  __new_sem_trywait (sem=0x7fffea256000 <_init>) at sem_wait.c:81
#2  0x00007fffea11e008 in _multiprocessing_SemLock_acquire_impl (self=self@entry=0x7fff3b9513f0, 
    blocking=blocking@entry=0, timeout_obj=None)
    at /home/albandes/local/installs/python3.11/debug/source/Modules/_multiprocessing/semaphore.c:342
#3  0x00007fffea11e2c5 in _multiprocessing_SemLock_acquire (self=0x7fff3b9513f0, args=<optimized out>, 
    args@entry=0x7ffff7fb5638, nargs=nargs@entry=1, kwnames=kwnames@entry=0x0)
    at /home/albandes/local/installs/python3.11/debug/source/Modules/_multiprocessing/clinic/semaphore.c.h:123
#4  0x00000000004e5913 in cfunction_vectorcall_FASTCALL_KEYWORDS (
    func=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7fff3b9513f0>, args=0x7ffff7fb5638, 
    nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:443
#5  0x00000000004a2bcf in _PyObject_VectorcallTstate (tstate=0x8f6618 <_PyRuntime+166328>, 
    callable=callable@entry=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7fff3b9513f0>, 
    args=args@entry=0x7ffff7fb5638, nargsf=9223372036854775809, kwnames=kwnames@entry=0x0)
    at ./Include/internal/pycore_call.h:92

And the corresponding python stack trace:

(gdb) py-bt
Traceback (most recent call first):
  <built-in method acquire of _multiprocessing.SemLock object at remote 0x7fff3b9513f0>
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/multiprocessing/synchronize.py", line 272, in notify
    assert not self._wait_semaphore.acquire(
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/multiprocessing/synchronize.py", line 297, in notify_all
    self.notify(n=sys.maxsize)
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/threading.py", line 716, in _release
    self._cond.notify_all()
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/threading.py", line 687, in wait
    self._release()
  File "/home/albandes/local/pytorch/3.11_debug_source/test/foo.py", line 13, in _mp_fn
    barrier.wait()
  File "/home/albandes/local/pytorch/3.11_debug_source/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/multiprocessing/spawn.py", line 133, in _main
    return self._bootstrap(parent_sentinel)
  File "/home/albandes/local/installs/python3.11/debug/install/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "<string>", line 1, in <module>

@albanD
Copy link
Collaborator

albanD commented Jul 25, 2023

btw this is a know upstream problem in CPython: python/cpython#77377

The simplest pure python repro:

import multiprocessing as mp
import faulthandler

# Just to print the segfault in the child
faulthandler.enable()

def _mp_fn(barrier):
    barrier.wait()

if __name__ == '__main__':
    barrier = mp.get_context("fork").Barrier(1)
    p = mp.get_context("spawn").Process(target=_mp_fn, args=(barrier,))
    p.start()
    p.join()

@albanD
Copy link
Collaborator

albanD commented Aug 23, 2023

This is now fixed in CPython and will nicely raise an error on 3.11+ (once the bugfix patch is released)

@albanD albanD closed this as completed Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: multiprocessing Related to torch.multiprocessing triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants