Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify map API in concurrent.futures #76487

Closed
dlukes mannequin opened this issue Dec 13, 2017 · 9 comments
Closed

Clarify map API in concurrent.futures #76487

dlukes mannequin opened this issue Dec 13, 2017 · 9 comments
Labels
3.7 (EOL) end of life docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dlukes
Copy link
Mannequin

dlukes mannequin commented Dec 13, 2017

BPO 32306
Nosy @pitrou, @dlukes
PRs
  • bpo-32306: Clarify c.f.Executor.map() documentation #4947
  • [3.6] bpo-32306: Clarify c.f.Executor.map() documentation (GH-4947) #4948
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-12-20.18:19:42.516>
    created_at = <Date 2017-12-13.17:15:53.829>
    labels = ['3.7', 'type-bug', 'library', 'docs']
    title = 'Clarify map API in concurrent.futures'
    updated_at = <Date 2017-12-21.08:44:55.457>
    user = 'https://github.com/dlukes'

    bugs.python.org fields:

    activity = <Date 2017-12-21.08:44:55.457>
    actor = 'dlukes'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2017-12-20.18:19:42.516>
    closer = 'pitrou'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2017-12-13.17:15:53.829>
    creator = 'dlukes'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32306
    keywords = ['patch']
    message_count = 9.0
    messages = ['308221', '308646', '308720', '308726', '308728', '308760', '308762', '308763', '308858']
    nosy_count = 3.0
    nosy_names = ['pitrou', 'docs@python', 'dlukes']
    pr_nums = ['4947', '4948']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32306'
    versions = ['Python 3.6', 'Python 3.7']

    @dlukes
    Copy link
    Mannequin Author

    dlukes mannequin commented Dec 13, 2017

    The docstring for concurrent.futures.Executor.map starts by stating that it is "Equivalent to map(func, *iterables)". In the case of Python 3, I would argue this is true only superficially: with map, the user expects memory-efficient processing, i.e. that the entire resulting collection will not be held in memory at once unless s/he requests so e.g. with list(map(...)). (In Python 2, the expectations are different of course.) On the other hand, while Executor.map superficially returns a generator, which seems to align with this expectation, what happens behind the scenes is that the call blocks until all results are computed and only then starts yielding them. In other words, they have to be stored in memory all at once at some point.

    The lower-level multiprocessing module also describes multiprocessing.pool.Pool.map as "A parallel equivalent of the map() built-in function", but at least it immediately goes on to add that "It blocks until the result is ready.", which is a clear indication that all of the results will have to be stored somewhere before being yielded.

    I can think of several ways the situation could be improved, listed here from most conservative to most progressive:

    1. Add "It blocks until the result is ready." to the docstring of Executor.map as well, preferably somewhere at the beginning.
    2. Reword the docstrings of both Executor.map and Pool.map so that they don't describe the functions as "equivalent" to built-in map, which raises the wrong expectations. ("Similar to map(...), but blocks until all results are collected and only then yields them.")
    3. I would argue that the function that can be described as semantically equivalent to map is actually Pool.imap, which yields results as they're being computed. It would be really nice if this could be added to the higher-level futures API, along with Pool.imap_unordered. Executor.map simply doesn't work for very long streams of data.
    4. Maybe instead of adding imap and imap_unordered methods to Executor, it would be a good idea to change the signature of Executor.map(func, *iterables, timeout=None, chunksize=1) to Executor.map(func, *iterables, timeout=None, chunksize=1, block=True, ordered=True), in order to keep the API simple with good defaults while providing flexibility via keyword arguments.
    5. I would go so far as to say that for me personally, the block argument to the version of Executor.map proposed in Update Python Software Foundation Copyright Year. #4 above should be False by default, because that would make it behave most like built-in map, which is the least suprising behavior. But I've observed that for smaller work loads, imap tends to be slower than map, so I understand it might be a tradeoff between performance and semantics. Still, in a higher-level API meant for non-experts, I think semantics should be emphasized.

    If the latter options seem much too radical, please consider at least something along the lines of #1 above, I think it would help people correct their expectations when they first encounter the API :)

    @dlukes dlukes mannequin added type-feature A feature request or enhancement 3.8 only security fixes stdlib Python modules in the Lib dir labels Dec 13, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Dec 19, 2017

    Hi David,

    what happens behind the scenes is that the call blocks until all results are computed and only then starts yielding them.

    The current implementation of the Executor.map() generator is:

            def result_iterator():
                try:
                    # reverse to keep finishing order
                    fs.reverse()
                    while fs:
                        # Careful not to keep a reference to the popped future
                        if timeout is None:
                            yield fs.pop().result()
                        else:
                            yield fs.pop().result(end_time - time.time())
                finally:
                    for future in fs:
                        future.cancel()

    So it seems to me that results are yielded as soon as they arrive (provided they arrive in the right order).

    @pitrou pitrou added docs Documentation in the Doc dir 3.7 (EOL) end of life and removed 3.8 only security fixes labels Dec 19, 2017
    @pitrou pitrou added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Dec 19, 2017
    @dlukes
    Copy link
    Mannequin Author

    dlukes mannequin commented Dec 20, 2017

    Hi Antoine,

    Thanks for the response! :) I think the problem lies in the line immediately preceding the code you've posted:

    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    

    In other words, all the jobs are first submitted and their futures stored in a list, which is then iterated over. This approach obviously breaks down when there is a great number of jobs, or when it's part of a pipeline meant for processing jobs continuously as they come.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 20, 2017

    I see. So the problem you are pointing out is that the tasks *arguments* are consumed eagerly. I agree that may be a problem in some cases, though I think in most cases people are concerned with the *results*.

    (note that multiprocessing.Pool() has an imap() method which does what you would like)

    @dlukes
    Copy link
    Mannequin Author

    dlukes mannequin commented Dec 20, 2017

    Yes, sorry for not being quite clear the first time around :)

    I eventually found out about Pool.imap (see item 3 on list in OP) and indeed it fits my use case very nicely, but my point was that the documentation is somewhat misleading with respect to the semantics of built-in map() in Python 3.

    Specifically, I would argue that it is unexpected for a function which claims to be "Equivalent to map(func, *iterables)" to require allocating a list the length of the shortest iterable.

    Maybe a code example will make this clearer for potential newcomers to the discussion -- this is what I would expect to happen (= the behavior of built-in map() itself), yielding values from the iterable is interleaved with calls to the mapped function:

    >>> def gen():
    ...     for i in range(3):
    ...         print("yielding", i)
    ...         yield i
    ...         
    >>> def add1(i):
    ...     print("adding 1 to", i)
    ...     return i + 1
    ... 
    >>> list(map(add1, gen()))
    yielding 0
    adding 1 to 0
    yielding 1
    adding 1 to 1
    yielding 2
    adding 1 to 2
    [1, 2, 3]
    

    This is what happens instead with concurrent.futures.Executor.map():

    >>> def my_map(fn, iterable):
    ...     lst = list(iterable)
    ...     for i in lst:
    ...         yield fn(i)
    ...         
    >>> list(my_map(add1, gen()))
    yielding 0
    yielding 1
    yielding 2
    adding 1 to 0
    adding 1 to 1
    adding 1 to 2
    [1, 2, 3]
    

    @pitrou
    Copy link
    Member

    pitrou commented Dec 20, 2017

    New changeset a7a751d by Antoine Pitrou in branch 'master':
    bpo-32306: Clarify c.f.Executor.map() documentation (bpo-4947)
    a7a751d

    @pitrou
    Copy link
    Member

    pitrou commented Dec 20, 2017

    New changeset 4aa84e7 by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
    bpo-32306: Clarify c.f.Executor.map() documentation (GH-4947) (bpo-4948)
    4aa84e7

    @pitrou
    Copy link
    Member

    pitrou commented Dec 20, 2017

    I think the documentation is now clearer. Closing!

    @pitrou pitrou closed this as completed Dec 20, 2017
    @dlukes
    Copy link
    Mannequin Author

    dlukes mannequin commented Dec 21, 2017

    Perfect, thanks!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant