Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks #55480

Closed
tbrink mannequin opened this issue Feb 21, 2011 · 19 comments
Closed

concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks #55480

tbrink mannequin opened this issue Feb 21, 2011 · 19 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@tbrink
Copy link
Mannequin

tbrink mannequin commented Feb 21, 2011

BPO 11271
Nosy @brianquinlan, @pitrou, @MojoVampire
Files
  • map_comparison.py: Demonstration and workaround
  • new_processpoolexecutor.py: Improved the workaround
  • map_chunksize.patch: Adds chunksize parameter to ProcessPoolExecutor.map.
  • map_chunksize_with_test.patch
  • test_mult.py: benchmark script
  • map_chunksize_with_docs.patch: Updated patch with doc changes.
  • map_chunksize_docs_update.patch: More tests, small change to how we "yield from" the result chunks
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-10-04.18:28:06.450>
    created_at = <Date 2011-02-21.15:04:48.909>
    labels = ['library', 'performance']
    title = "concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks"
    updated_at = <Date 2014-10-05.00:28:50.713>
    user = 'https://bugs.python.org/tbrink'

    bugs.python.org fields:

    activity = <Date 2014-10-05.00:28:50.713>
    actor = 'dan.oreilly'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-10-04.18:28:06.450>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2011-02-21.15:04:48.909>
    creator = 'tbrink'
    dependencies = []
    files = ['20825', '20826', '36067', '36184', '36185', '36306', '36352']
    hgrepos = []
    issue_num = 11271
    keywords = ['patch']
    message_count = 19.0
    messages = ['128963', '128970', '137351', '137359', '137692', '155114', '223802', '223803', '223855', '223966', '224369', '224372', '224451', '225045', '225143', '225214', '228472', '228473', '228514']
    nosy_count = 8.0
    nosy_names = ['bquinlan', 'pitrou', 'neologix', 'tbrink', 'python-dev', 'sbt', 'josh.r', 'dan.oreilly']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue11271'
    versions = ['Python 3.5']

    @tbrink
    Copy link
    Mannequin Author

    tbrink mannequin commented Feb 21, 2011

    I tested the new concurrent.futures.ProcessPoolExecutor.map() in 3.2 with the is_prime() function from the documentation example. This was significantly slower than using multiprocessing.Pool.map(). Quick look at the source showed that multiprocessing sends the iterable in chunks to the worker process while futures sends always only one entry of the iterable to the worker.

    Functions like is_prime() which finish relatively fast make the communication overhead (at least I guess that is the culprit) very big in comparison.

    Attached is a file which demonstrates the problem and a quick workaround. The workaround uses the chunk idea from multiprocessing. The problem is that it requires the iterables passed to map() to have a length and be indexable with a slice. I believe this limitation could be worked around.

    @tbrink tbrink mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Feb 21, 2011
    @tbrink
    Copy link
    Mannequin Author

    tbrink mannequin commented Feb 21, 2011

    Playing around a bit I wrote the attached implementation which works with all iterables.

    @brianquinlan
    Copy link
    Contributor

    brianquinlan commented May 31, 2011

    On my crappy computer, ProcessPoolExecutor.map adds <3ms of added execution time per call using your test framework.

    What is your use case where that is too much?

    That being said, I don't have any objections to making improvements.

    If you want to pursue this, could you attach a working map_comprison.py?

    @tbrink
    Copy link
    Mannequin Author

    tbrink mannequin commented May 31, 2011

    I can confirm an overhead of 2 ms to 3 ms using a relatively recent Intel Core i5 CPU.

    I (personally) believe these 3 ms to be a pretty big overhead on modern computers and I also believe that it would be relatively simple to reduce it.

    I don't have much time at the moment but I will try to produce a complete proof of concept patch for the futures module in the next weeks. I'd be happy to get some comments when it is finished.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 5, 2011

    Using your test script fixed (on Python 3.3), I get the following numbers:

    Starting multiproc...done in 2.1014609336853027 s.
    Starting futures...done in 20.209479093551636 s.
    Starting futures "fixed"...done in 2.026125907897949 s.

    So there's a 0.2ms overhead per remote function call here (20/(100100000-100000000)).

    Can't your chunks() function use itertools.islice()?

    Also, the chunksize can't be anything else than 1 by default, since your approach is increasing latency of returning results.

    @pitrou pitrou changed the title concurrent.futures.ProcessPoolExecutor.map() slower than multiprocessing.Pool.map() for fast function argument concurrent.futures.ProcessPoolExecutor.map() doesn't batch function arguments by chunks Jun 5, 2011
    @brianquinlan
    Copy link
    Contributor

    brianquinlan commented Mar 7, 2012

    I'm closing this since tbrink didn't respond to pitrou's comments.

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Jul 24, 2014

    I'm seeing an even larger difference between multiprocessing.Pool and ProcessPoolExecutor on my machine, with Python 3.4:

    Starting multiproc...done in 2.160644769668579 s.
    Starting futures...done in 67.953957319259644 s.
    Starting futures "fixed"...done in 2.134932041168213 s.

    I've updated the initial patch to address the comments Antoine made; the chunksize now defaults to 1, and itertools is used to chunk the input iterables, rather than building a list. Attached is an updated benchmark script:

    Starting multiproc...done in 2.2295100688934326 s.
    Starting futures...done in 68.5991039276123 s.
    Starting futures "fixed" (no chunking)...done in 69.18992304801941 s.
    Starting futures "fixed" (with chunking)...done in 2.352942705154419 s.

    The new implementation of map has essentially identical performance to the original with chunksize=1, but it performs much better with a larger chunksize provided.

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Jul 24, 2014

    Here's a patch that adds the new map implementation from the benchmark script to concurrent.futures.process.

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Jul 24, 2014

    I've added new versions of the patch/demonstration that ensures we actually submit all the futures before trying to yield from the returned iterator. Previously we were just returning a generator object when map was called, without actually submitting any futures.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 25, 2014

    Thank you for posting this, I'm reopening the issue.

    @pitrou pitrou reopened this Jul 25, 2014
    @pitrou
    Copy link
    Member

    pitrou commented Jul 31, 2014

    Thank you for the patch. I've posted some review comments. Two further remarks:

    • this makes the ProcessPool API slightly different from the ThreadPool one. I wonder if this is acceptable or not. Thoughts?

    • the patch would need unit tests for the additional functionality

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Jul 31, 2014

    re: Diverging ThreadPoolExecutor and ProcessPoolExecutor APIs. I thought about this as well. It would of course be possible to make ThreadPoolExecutor's API match, but it would serve no useful purpose that I can think of. I guess we could make the ThreadPoolExecutor API accept the chunksize argument just so the APIs match, but then not actually use it (and document that it's not used, of course). That would make it easier to switch between the two APIs, at least.

    On the other hand, the two APIs are slightly different already: ThreadPoolExecutor requires max_workers be supplied, while ProcessPoolExecutor will default to multiprocessing.cpu_count(). So this wouldn't completely be without precedent.

    Anyway, I will review your comments on the patch, add unit tests, and re-submit. Thanks!

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Jul 31, 2014

    I've attached an updated patch based on your review comments; there's now a unit test with a non-default chunksize, the chunking algorithm is more readable, and the same code path is used no matter what chunksize is provided.

    I've also updated the test script to match the implementation changes.

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Aug 7, 2014

    Here's an updated patch that adds documentation and Antoine's requested code changes.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 10, 2014

    Thanks for the update! This looks basically good, I'll wait a bit to see if other people have comments, otherwise I'll commit.

    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Aug 12, 2014

    A couple of small updates based on comments from Charles-François Natali:

    • Use itertools.chain.from_iterable to "yield from" the result chunks instead of a for loop.

    • Add more tests with varying chunksizes.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 4, 2014

    New changeset f87c2c4f03da by Antoine Pitrou in branch 'default':
    Issue bpo-11271: concurrent.futures.Executor.map() now takes a *chunksize*
    https://hg.python.org/cpython/rev/f87c2c4f03da

    @pitrou
    Copy link
    Member

    pitrou commented Oct 4, 2014

    Sorry for the delay. I simplified the patch a tiny bit ("yield from" wasn't needed, it was sufficient to result the itertools.chain.from_iterable() result), and committed it. Thank you!

    @pitrou pitrou closed this as completed Oct 4, 2014
    @danoreilly
    Copy link
    Mannequin

    danoreilly mannequin commented Oct 5, 2014

    Hey, my first committed patch :) Thanks for helping to push this through, Antoine!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants