Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor should accept an initializer argument #65622

Closed
andreasvc mannequin opened this issue May 3, 2014 · 30 comments
Closed
Labels
3.7 stdlib type-feature

Comments

@andreasvc
Copy link
Mannequin

@andreasvc andreasvc mannequin commented May 3, 2014

BPO 21423
Nosy @vsajip, @brianquinlan, @mdickinson, @pitrou, @giampaolo, @PCManticore, @serhiy-storchaka, @s0undt3ch, @thehesiod, @MojoVampire, @andreasvc, @nchammas
PRs
  • #4241
  • #4347
  • Files
  • pool_initializer.patch: add initializer and initargs keywords to both ThreadPoolExecutor and ProcessPoolExecutor
  • pool_initializer_tests.patch: add tests
  • pool_init.patch: Update patch to apply against latest, tweak exception handling
  • init_patch_updated.patch: Supress tracebacks in negative tests, fix race condition and a typo in ThreadPoolExecutor
  • init_patch_updated2.patch: Make ProcessPool/ThreadPool behavior when the initializer fails consistent.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-11-09.14:51:39.458>
    created_at = <Date 2014-05-03.23:38:08.555>
    labels = ['3.7', 'type-feature', 'library']
    title = 'concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor should accept an initializer argument'
    updated_at = <Date 2017-11-09.14:59:25.865>
    user = 'https://github.com/andreasvc'

    bugs.python.org fields:

    activity = <Date 2017-11-09.14:59:25.865>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-11-09.14:51:39.458>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2014-05-03.23:38:08.555>
    creator = 'andreasvc'
    dependencies = []
    files = ['35179', '35181', '36047', '36335', '36339']
    hgrepos = []
    issue_num = 21423
    keywords = ['patch']
    message_count = 30.0
    messages = ['217846', '218019', '218020', '218024', '218040', '218090', '218093', '218462', '218518', '223741', '225159', '225174', '225176', '225177', '239166', '261928', '279755', '299285', '305451', '305463', '305550', '305601', '305602', '305605', '305873', '305954', '305956', '305959', '305962', '305963']
    nosy_count = 16.0
    nosy_names = ['vinay.sajip', 'bquinlan', 'mark.dickinson', 'pitrou', 'giampaolo.rodola', 'Claudiu.Popa', 'jcon', 'serhiy.storchaka', 'mdengler', 's0undt3ch', 'thehesiod', 'josh.r', 'andreasvc', 'dan.oreilly', 'nchammas', 'Shiming He']
    pr_nums = ['4241', '4347']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue21423'
    versions = ['Python 3.7']

    @andreasvc
    Copy link
    Mannequin Author

    @andreasvc andreasvc mannequin commented May 3, 2014

    It would be useful if concurrent.futures.ThreadPoolExecutor took an initializer argument, like multiprocessing.Pool.

    This is useful for example to load a large dataset once upon initialization of each worker process, without have to pass the dataset as an argument with every job submission, which requires serialization.

    concurrent.futures has some advantages over multiprocessing such as detecting killed processes ( http://bugs.python.org/issue9205 ), so it would be good if the advantages of both can be combined.

    It appears that the following fork of concurrent.futures has added these arguments: https://github.com/enthought/encore/blob/7101984bc384da8e7975876ca2809cc0103c3efc/encore/concurrent/futures/enhanced_thread_pool_executor.py

    @andreasvc andreasvc mannequin added the stdlib label May 3, 2014
    @MojoVampire
    Copy link
    Mannequin

    @MojoVampire MojoVampire mannequin commented May 6, 2014

    Do you mean ProcessPoolExecutor? Thread based pools don't involve serialization.

    It would be good for ThreadPoolExecutor to have it as well, to make it easier to switch between executors, but only ProcessPoolExecutor is suffering from serialization overhead. Threads share the same memory space after all; references to data get passed directly, though you might choose to copy.copy or copy.deepcopy a root data "template" so each thread has its own unique copy that it can mutate.

    @andreasvc
    Copy link
    Mannequin Author

    @andreasvc andreasvc mannequin commented May 6, 2014

    Yes I did mean ProcessPoolExecutor, but indeed, it's good to have for threads as well.

    I could try to make a patch if it is likely that it would be accepted.

    @MojoVampire
    Copy link
    Mannequin

    @MojoVampire MojoVampire mannequin commented May 6, 2014

    I'm not a core developer, but writing the patch is usually considered helpful. Two notes:

    1. Make sure to write unit tests for any new behavior
    2. I'd suggest making any such argument keyword-only; if we move closer to the Java executor model, that means having a lot of options, the majority of which would be left as the default by users. Binding the API to a particular argument order is sub-optimal (it makes it even harder to deprecate arguments for instance), so enforcing keyword only behavior ensures users can't write call lines that take dependencies on argument ordering.

    @mdickinson
    Copy link
    Member

    @mdickinson mdickinson commented May 7, 2014

    BTW, I think there's a design mistake in the EnhancedThreadPoolExecutor that's worth avoiding in any std. lib. implementation: the initialiser and uninitialiser for the EnhancedThreadPoolExecutor accept no arguments. In retrospect, it would have been better to have them take the thread itself as a single argument. We often found ourselves needing this - it's not hard to work around with a threading.current_thread() call, but it's mildly annoying to have to do so.

    @andreasvc
    Copy link
    Mannequin Author

    @andreasvc andreasvc mannequin commented May 7, 2014

    Here's a patch. I have added initializer and initargs keywords to both ThreadPoolExecutor and ProcessPoolExecutor, with the same semantics as multiprocessing.Pool.

    I couldn't figure out what to do if the initializer fails with a ProcessPoolExecutor: how to properly send the traceback back? I also haven't gotten around to figure out how to write tests.
    I haven't added unitializers, don't know if they would be useful.

    @andreasvc
    Copy link
    Mannequin Author

    @andreasvc andreasvc mannequin commented May 8, 2014

    Here's a version with tests. Detecting an execption in the initializer works with ProcessPoolExecutor, but not with ThreadPoolExecutor.

    @giampaolo
    Copy link
    Contributor

    @giampaolo giampaolo commented May 13, 2014

    Related:
    https://groups.google.com/forum/#!topic/dev-python/ytbYwHXKC6o
    I'm not sure how what is proposed here would be useful for ThreadPoolExecutor but it would definitely be helpful being able to set an initializer for ProcessPoolExecutor because right now it seems it's impossible to cleanly shutdown the executor, see:
    http://noswap.com/blog/python-multiprocessing-keyboardinterrupt

    @andreasvc
    Copy link
    Mannequin Author

    @andreasvc andreasvc mannequin commented May 14, 2014

    Giampaolo, this patch is for ProcessPoolExecutor as well.

    About keyboard interrupts, if my tests are correct, they work
    in Python 3.3+ with both multiprocessing and concurrent.futures.
    (Although for the latter I have to hit ctrl-c twice).

    @danoreilly
    Copy link
    Mannequin

    @danoreilly danoreilly mannequin commented Jul 23, 2014

    It seems like everyone agrees that this functionality is useful, so I'm reviving this in hopes of getting a patch pushed through. I've updated Andreas' patch so that it applies cleanly against the latest tree, and tweaked the handling of exceptions in initializer. Now, ProcessPoolExecutor will raise a BrokenPoolException should an initializer method fail, and ThreadPoolExecutor will raise a RunTimeError stating that the pool can't be used because an initializer failed.

    I was hoping to use multiprocessing.Pool's handling of initializer exceptions as a guide for the right behavior here, but it actually does terrible job: an exception raised in the initializer is completely unhandled, and results in an endless loop of new processes being started up and immediately failing. But that's a separate bug report. :)

    For now there are still unit tests for testing exceptions being raised in the initializer, but they're noisy; the traceback for each initializer exception gets printed to stdout. I'm not sure if that's undesirable behavior or not.

    If the new behavior looks ok, the docs will need an update to.

    @danoreilly danoreilly mannequin changed the title concurrent.futures.ThreadPoolExecutor should accept an initializer argument concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor should accept an initializer argument Aug 10, 2014
    @danoreilly
    Copy link
    Mannequin

    @danoreilly danoreilly mannequin commented Aug 10, 2014

    Here's an updated patch. Changes:

    • Fixed what appears to have been a find/replace typo I made prior to uploading the previous patch.

    • The tracebacks from the negative unit tests are now suppressed.

    • Fixed a race condition in the initializer failure handling for ThreadPoolExecutor. Futures that were submitted before the initializer actually failed will now raise a RuntimeError indicating that initializer failed.

    • Suppressed an occasional queue.Full exception that would pop up while shutting down a ProcessPoolExecutor that was broken due to an initializer fail. As best as I can tell the exception is harmless, so suppressing it should be ok.

    *Updated the docs.

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Aug 10, 2014

    It seems the addition of the initargs argument doesn't tackle Mark's objection here:

    the initialiser and uninitialiser for the EnhancedThreadPoolExecutor
    accept no arguments. In retrospect, it would have been better to have
    them take the thread itself as a single argument.

    Regardless, I'm going to review the patch soon.

    @danoreilly
    Copy link
    Mannequin

    @danoreilly danoreilly mannequin commented Aug 11, 2014

    It seems the addition of the initargs argument doesn't tackle Mark's objection here:

    > the initialiser and uninitialiser for the EnhancedThreadPoolExecutor
    > accept no arguments. In retrospect, it would have been better to have
    > them take the thread itself as a single argument.

    This would be inconsistent with multiprocessing.Pool's initializer/initargs behavior. I'm not sure if consistency is important there or not, but its worth pointing out.

    I'm also not sure how useful it would be for ProcessPoolExecutor to receive an instance of itself. With multiprocessing.Pool, initializer is a commonly used to pass objects that can only be inherited to workers (Queues, Locks, etc.). The same pattern would be useful for ProcessPoolExecutor, which means initializer needs to be able to take an arbitrary number of arguments, rather than just an instance to the Process object itself.

    @danoreilly
    Copy link
    Mannequin

    @danoreilly danoreilly mannequin commented Aug 11, 2014

    Another updated patch. This one changes ProcessPoolExecutor behavior so that RuntimeErrors are raised in any active Futures, and on subsequent calls to submit after the initializer fails. This makes its behavior consistent with ThreadPoolExecutor.

    @danoreilly danoreilly mannequin added the type-feature label Mar 24, 2015
    @mdickinson
    Copy link
    Member

    @mdickinson mdickinson commented Mar 24, 2015

    I'm also not sure how useful it would be for ProcessPoolExecutor to receive an instance of itself.

    Agreed that this doesn't make much sense; I hadn't really thought about the ProcessPoolExecutor case. I withdraw my comments!

    @thehesiod
    Copy link
    Mannequin

    @thehesiod thehesiod mannequin commented Mar 17, 2016

    any chance if this getting into 3.5.2? I have some gross code to get around it (setting global properties)

    @s0undt3ch
    Copy link
    Mannequin

    @s0undt3ch s0undt3ch mannequin commented Oct 31, 2016

    Would also love to see this in the stdlib soon. My use case is logging setup(forward logs to the main process).

    @anntzer
    Copy link
    Mannequin

    @anntzer anntzer mannequin commented Jul 27, 2017

    For cross-referencing purposes: I have proposed in http://bugs.python.org/issue25293 to allow passing a Thread/Process subclass as argument instead of an initializer function, which would both handle Mark Dickinson's comment (http://bugs.python.org/issue21423#msg218040) about passing the thread object as argument, and also allow for finalization.

    @pitrou pitrou added the 3.7 label Nov 2, 2017
    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Nov 2, 2017

    I've opened a PR for this at #4241, loosely based on Dan's last patch.

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Nov 3, 2017

    By the way, the submitted PR follows Dan's argument about the initializer's argument: the actual call is initializer(*initargs). If someone wants to know about the current thread or process, it's trivial to call thread.current_thread() or multiprocessing.current_process() (and I don't it's a bad idiom in itself :-)).

    If you want to comment on the PR, I would recommend doing it quick, as I plan to merge in a day or two :-)

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Nov 4, 2017

    New changeset 63ff413 by Antoine Pitrou in branch 'master':
    bpo-21423: Add an initializer argument to {Process,Thread}PoolExecutor (bpo-4241)
    63ff413

    @pitrou pitrou closed this as completed Nov 4, 2017
    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 5, 2017

    test_concurrent_futures now produces too much output on stderr.

    $ ./python -m test test_concurrent_futures >/dev/null
    Exception in initializer:
    Traceback (most recent call last):
      File "/home/serhiy/py/cpython/Lib/concurrent/futures/process.py", line 170, in _process_worker
        initializer(*initargs)
      File "/home/serhiy/py/cpython/Lib/test/test_concurrent_futures.py", line 66, in init_fail
        raise ValueError('error in initializer')
    ValueError: error in initializer
    Exception in initializer:
    Traceback (most recent call last):
      File "/home/serhiy/py/cpython/Lib/concurrent/futures/process.py", line 170, in _process_worker
        initializer(*initargs)
      File "/home/serhiy/py/cpython/Lib/test/test_concurrent_futures.py", line 66, in init_fail
        raise ValueError('error in initializer')
    ValueError: error in initializer
    ...

    What is worse, this output looks as errors report.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 5, 2017

    And please don't forget to edit a commit message when merge a PR.

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Nov 5, 2017

    Unfortunately, there is no obvious way to capture the output of the child process here, short of running the entire test under a subprocess.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 8, 2017

    What is the best way to silence logging in subprocesses?

    @vsajip
    Copy link
    Member

    @vsajip vsajip commented Nov 9, 2017

    What is the best way to silence logging in subprocesses?

    Are you referring to the output shown in msg305601? If it's caused by logging statements, the best way would be either to pipe stderr to /dev/null or to change the logging to use sys.stdout (sys.stderr is just the default) and then pipe stdout to /dev/null.

    I hope I haven't misunderstood your question, but I fear I may have.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 9, 2017

    I'm not well experienced with logging, but if we can change the logging in subprocesses, I think we could change it to not output messages at all. It would be better to save logging messages in subprocesses and check that expected logging messages are emitted in the main process. There is assertLogs(), but it works only with logging in the same process.

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Nov 9, 2017

    Serhiy, I think I have found a way to deal with the logging output:
    #4347

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Nov 9, 2017

    New changeset 0a2ff23 by Antoine Pitrou in branch 'master':
    Silence error output in test_concurrent_futures (bpo-21423) (bpo-4347)
    0a2ff23

    @pitrou pitrou closed this as completed Nov 9, 2017
    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Nov 9, 2017

    Thank you Antoine!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 stdlib type-feature
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants