Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker stall in multiprocessing.Pool #70094

Open
chroxvi mannequin opened this issue Dec 18, 2015 · 1 comment
Open

Worker stall in multiprocessing.Pool #70094

chroxvi mannequin opened this issue Dec 18, 2015 · 1 comment
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@chroxvi
Copy link
Mannequin

chroxvi mannequin commented Dec 18, 2015

BPO 25906
Nosy @applio

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-12-18.14:40:44.774>
labels = ['type-bug']
title = 'Worker stall in multiprocessing.Pool'
updated_at = <Date 2015-12-18.16:38:09.972>
user = 'https://bugs.python.org/chroxvi'

bugs.python.org fields:

activity = <Date 2015-12-18.16:38:09.972>
actor = 'r.david.murray'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2015-12-18.14:40:44.774>
creator = 'chroxvi'
dependencies = []
files = []
hgrepos = []
issue_num = 25906
keywords = []
message_count = 1.0
messages = ['256684']
nosy_count = 3.0
nosy_names = ['sbt', 'davin', 'chroxvi']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25906'
versions = ['Python 2.7', 'Python 3.3', 'Python 3.4', 'Python 3.5']

@chroxvi
Copy link
Mannequin Author

chroxvi mannequin commented Dec 18, 2015

I am experiencing some seemingly random stalls in my scientific simulations that make use of a multiprocessing.Pool for parallelization. It has been incredibly difficult for me to come up with an example that consistently reproduces the problem. It seems more or less random if and when the problem occurs. The below snippet is my best shot at something that has a good chance at hitting the problem. I know it is unfortunate to have PyTables in the mix but it is the only example I have been able to come up with that almost always hit the problem. I have been able to reproduce the problem (once!) by simply removing the with-statement (and thus PyTables) in the work function. However, by doing so (at least in my runs), the chance of hitting the problem almost vanishes. Also, judging from the output of the script, it seems that the cause of the problem is to be found in Python and not in PyTables.

import os
import multiprocessing as mp
import tables

_hdf_db_name = 'join_crash_test.hdf'
_lock = mp.Lock()


class File():

    def __init__(self, *args, **kwargs):
        self._args = args
        self._kwargs = kwargs

        if len(args) > 0:
            self._filename = args[0]
        else:
            self._filename = kwargs['filename']

    def __enter__(self):
        _lock.acquire()
        self._file = tables.open_file(*self._args, **self._kwargs)
        return self._file

    def __exit__(self, type, value, traceback):
        self._file.close()
        _lock.release()


def work(task):
    worker_num, iteration = task

    with File(_hdf_db_name, mode='a') as h5_file:
        h5_file.create_array('/', 'a{}_{}'.format(worker_num, iteration),
                             obj=task)
    print('Worker {} finished writing to HDF table at iteration {}'.format(
        worker_num, iteration))

    return (worker_num, iteration)

iterations = 10
num_workers = 24
maxtasks = 1

if os.path.exists(_hdf_db_name):
    os.remove(_hdf_db_name)

for iteration in range(iterations):
    print('Now processing iteration: {}'.format(iteration))
    tasks = zip(range(num_workers), num_workers * [iteration])
    try:
        print('Spawning worker pool')
        workers = mp.Pool(num_workers, maxtasksperchild=maxtasks)
        print('Mapping tasks')
        results = workers.map(work, tasks, chunksize=1)
    finally:
        print('Cleaning up')
        workers.close()
        print('Workers closed - joining')
        workers.join()
        print('Process terminated')

In most of my test runs, this example stalls at "Workers closed - joining" in one of the iterations. Hitting C-c and inspecting the stack shows that the main process is waiting for a single worker that never stops executing. I have tested the example on various combinations of the below listed operating systems and Python version.

Ubuntu 14.04.1 LTS
Ubuntu 14.04.3 LTS
ArchLinux (updated as of December 14, 2015)

Python 2.7.10 :: Anaconda 2.2.0 (64-bit)
Python 2.7.11 :: Anaconda 2.4.0 (64-bit)
Python 2.7.11 (Arch Linux 64-bit build)
Python 3.3.5 :: Anaconda 2.1.0 (64-bit)
Python 3.4.3 :: Anaconda 2.3.0 (64-bit)
Python 3.5.0 :: Anaconda 2.4.0 (64-bit)
Python 3.5.1 (Arch Linux 64-bit build)
Python 3.5.1 :: Anaconda 2.4.0 (64-bit)

It seems that some combinations are more likely to reproduce the problem than others. In particular, all the Python 3 builds reproduce the problem on almost every run, whereas I have not been able to reproduce the problem with the above example on any version of Python 2. I have, however, seen what appears to be the same problem in one of my simulations using Python 2.7.11. After 5 hours it stalled very close to the point of closing a Pool. Inspecting the HDF database holding the results showed that all but a single of the 4000 tasks submitted to the Pool finished. To me, this suggests that a single worker never finished executing.

The problem I am describing here might very well be related to bpo-9205 as well as bpo-22393. However, I am not sure how to verify if this is indeed the case or not.

@chroxvi chroxvi mannequin added the type-bug An unexpected behavior, bug, or error label Dec 18, 2015
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

1 participant